工作中很多时候需要将全角字符转换成半角字符。之前一直处理不好。最近搞出来了个还算好用的方法。
首先,全角转半角这个事儿我肯定不自己去做:太累了……所以必须去找工具。unicodedata
就是个很好的工具。里面的normalize
方法提供了对字符的规范化操作,比如将日语半宽假名规范化为全宽假名、将欧洲语系里面的字母拆解为基础字幕+装饰字母,等等。阅读Doc后决定使用NFKC
这个参数对字符串进行规范化。
本来以为事情就这么解决了,但是使用过程中发现,还是有很多地方不如我愿,例如中文句号不能被正规化为英文句号。
那么,这部分不能自动处理的,就只能自力更生了——将自己收集到的例子集合起来,使用string.translate
一股脑解决。
所以呢,代码大致如下
import string
import unicodedata
FULL_HALF_MAP = {
# Range U+FF01–FF5E reproduces the characters of ASCII 21 to 7E as fullwidth forms,
# that is, a fixed width form used in CJK computing. This is useful for typesetting
# Latin characters in a CJK environment. U+FF00 does not correspond to a fullwidth
# ASCII 20 (space character), since that role is already fulfilled by U+3000
# "ideographic space."
**{i + 0xFEE0: i for i in range(0x21, 0x7F)},
**{0x3000: 0x20}
}
FULL_HALF_ENHANCE_MAP = {
**FULL_HALF_MAP,
# Some custom mappings
0xa0: "", # The non-breaking space
ord("。"): ord("."), # The full stop mark
65533: "", # The replacement of invalid character
**{i: 45 for i in [8208, 8211, 8212, 8722]}, # Some dash marks
**{i: 183 for i in [8226]}, # Some dot marks
**{ord(i): ord("\"") for i in "“”"}, # The quotation marks
}
def full_to_half(txt: str, enhance=False):
# Fast normalize the text using NFKC.
# NOTICE: This function has some side-effects.
# Ref: https://en.wikipedia.org/wiki/Unicode_equivalence
txt = unicodedata.normalize('NFKC', txt)
return txt.translate(FULL_HALF_ENHANCE_MAP if enhance else FULL_HALF_MAP)
发表回复