參考網址
https://stackoverflow.com/questions/2718196/find-all-chinese-text-in-a-string-using-python-and-regex
python的字串都是unicode
而全部中文的範圍在0x4e00 ~ 0x9fff之間
所以一個字串你想要取其中的中文可以這樣寫def get_chinese(self, text): result = re.findall(r'[\u4e00-\u9fff]+', text) output = "".join(result) return output
補充
這一個字串你想寫入cp950編碼的檔案
但有一些字不符合該怎麼做?
with open("cp950.txt", mode="w", encoding="cp950") as f: f.write(text)
如果都不處理的話你會得到
UnicodeEncodeError: 'cp950' codec can't encode character '\uff6d' in position 98: illegal multibyte sequence
其實只要像下面這樣做就可以了
encode_text = text.encode("cp950", "ignore") decode_text = encode_text.decode("cp950", "ignore") with open("cp950.txt", mode="w", encoding="cp950") as f: f.write(decode_text)