全角字符转半角字符

工作中很多时候需要将全角字符转换成半角字符。之前一直处理不好。最近搞出来了个还算好用的方法。

首先,全角转半角这个事儿我肯定不自己去做:太累了……所以必须去找工具。unicodedata 就是个很好的工具。里面的 normalize 方法提供了对字符的规范化操作,比如将日语半宽假名规范化为全宽假名、将欧洲语系里面的字母拆解为基础字幕+装饰字母,等等。阅读 Doc 后决定使用 NFKC 这个参数对字符串进行规范化。

本来以为事情就这么解决了,但是使用过程中发现,还是有很多地方不如我愿,例如中文句号不能被正规化为英文句号。

那么,这部分不能自动处理的,就只能自力更生了——将自己收集到的例子集合起来,使用 string.translate 一股脑解决。


所以呢,代码大致如下

import string

import unicodedata

FULL_HALF_MAP = {
    # Range U+FF01–FF5E reproduces the characters of ASCII 21 to 7E as fullwidth forms,
    # that is, a fixed width form used in CJK computing. This is useful for typesetting
    # Latin characters in a CJK environment. U+FF00 does not correspond to a fullwidth
    # ASCII 20 (space character), since that role is already fulfilled by U+3000
    # "ideographic space."
    **{i + 0xFEE0: i for i in range(0x21, 0x7F)},
    **{0x3000: 0x20}
}
FULL_HALF_ENHANCE_MAP = {
    **FULL_HALF_MAP,
    # Some custom mappings
    0xa0: "",  # The non-breaking space
    ord("。"): ord("."),  # The full stop mark
    65533: "",  # The replacement of invalid character
    **{i: 45 for i in [8208, 8211, 8212, 8722]},  # Some dash marks
    **{i: 183 for i in [8226]},  # Some dot marks
    **{ord(i): ord("\"") for i in "“”"},  # The quotation marks
}


def full_to_half(txt: str, enhance=False):
    # Fast normalize the text using NFKC.
    # NOTICE: This function has some side-effects.
    # Ref: https://en.wikipedia.org/wiki/Unicode_equivalence
    txt = unicodedata.normalize('NFKC', txt)
    return txt.translate(FULL_HALF_ENHANCE_MAP if enhance else FULL_HALF_MAP)

留下评论