python unicodedata模組用法

2022-06-23 14:00:54

unicodedata.lookup(name)
unicodedata.name(chr[,default])
unicodedata.decimal(chr[, default])
unicodedata.digit(chr[, default])
unicodedata.numeric(chr[, default])
unicodedata.category(chr)
unicodedata.bidirectional(chr)
unicodedata.combining(chr)
unicodedata.east_asian_width(chr)
unicodedata.mirrored(chr)
unicodedata.decomposition(chr)
unicodedata.normalize(form, unistr)
unicodedata.unidata_version

UCD介紹

UCD是Unicode字元資料庫（Unicode Character DataBase）的縮寫。

UCD由一些描述Unicode字元屬性和內部關係的純文字或html檔案組成。

UCD中的文字檔案大都是適合於程式分析的Unicode相關資料。其中的html檔案解釋了資料庫的組織，資料的格式和含義。

UCD中最龐大的檔案無疑就是描述漢字屬性的檔案Unihan.txt。

在UCD 5.0,0中，Unihan.txt檔案大小有28,221K位元組。Unihan.txt中包含了很多有參考價值的索引，例如漢字部首、筆劃、拼音、使用頻度、四角號碼排序等。這些索引都是基於一些比較權威的辭典，但大多數索引只能檢索部分漢字。

unicodedata.lookup(name)

通過名稱來查詢一個字元。如果字元存在就返回相應字元，如果不存在丟擲異常KeyError。

>>> import unicodedata
>>> print(unicodedata.lookup('LEFT CURLY BRACKET'))
{
>>> print(unicodedata.lookup('LEFT'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: "undefined character name 'LEFT'"
>>>

unicodedata.name(chr[,default])

通過字元來查詢它的名稱。如果成功返回相應名稱，否則丟擲異常ValueError。

>>> import unicodedata
>>> print(unicodedata.name('{'))
LEFT CURLY BRACKET
>>> print(unicodedata.name('@'))
COMMERCIAL AT
>>> print(unicodedata.name('{{'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: name() argument 1 must be a unicode character, not str
>>>

unicodedata.decimal(chr[, default])

返回表示數位字元的數值。如果給一個沒有數位的值時，會丟擲異常ValueError。

>>> import unicodedata
>>> print(unicodedata.decimal('7'))
7
>>> print(unicodedata.decimal('7a'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: decimal() argument 1 must be a unicode character, not str
>>>

unicodedata.digit(chr[, default])

把一個合法的數位字串轉換為數位值，比如0到9的字串轉換為相應的數位值。如果非法的字串，丟擲異常ValueError。

>>> import unicodedata
>>> print(unicodedata.digit('9', None))
9
>>> print(unicodedata.digit('9a', None))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: digit() argument 1 must be a unicode character, not str
>>>

unicodedata.numeric(chr[, default])

把一個表示數位的字串轉換為浮點數返回。比如可以把‘8’，‘四’轉換數值輸出。與digit（）不一樣的地方是它可以任意表示數值的字元都可以，不僅僅限於0到9的字元。如果不是合法字元，會丟擲異常ValueError。

>>> import unicodedata
>>> print(unicodedata.numeric('四', None))
4.0
>>> print(unicodedata.numeric('8', None))
8.0
>>> print(unicodedata.numeric('8a', None))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: numeric() argument 1 must be a unicode character, not str
>>>

unicodedata.category(chr)

把一個字元返回它在UNICODE裡分類的型別。具體型別如下：

Code Description

[Cc] Other, Control

[Cf] Other, Format

[Cn] Other, Not Assigned (no characters in the file have this property)

[Co] Other, Private Use

[Cs] Other, Surrogate

[LC] Letter, Cased

[Ll] Letter, Lowercase

[Lm] Letter, Modifier

[Lo] Letter, Other

[Lt] Letter, Titlecase

[Lu] Letter, Uppercase

[Mc] Mark, Spacing Combining

[Me] Mark, Enclosing

[Mn] Mark, Nonspacing

[Nd] Number, Decimal Digit

[Nl] Number, Letter

[No] Number, Other

[Pc] Punctuation, Connector

[Pd] Punctuation, Dash

[Pe] Punctuation, Close

[Pf] Punctuation, Final quote (may behave like Ps or Pe depending on usage)

[Pi] Punctuation, Initial quote (may behave like Ps or Pe depending on usage)

[Po] Punctuation, Other

[Ps] Punctuation, Open

[Sc] Symbol, Currency

[Sk] Symbol, Modifier

[Sm] Symbol, Math

[So] Symbol, Other

[Zl] Separator, Line

[Zp] Separator, Paragraph

[Zs] Separator, Space

>>> import unicodedata
>>> print(unicodedata.category('四'))
Lo
>>> print(unicodedata.category('8'))
Nd
>>> print(unicodedata.category('a'))
Ll
>>>

unicodedata.bidirectional(chr)

把一個字元給出它的分類，以便進行從左到右，還是從右到左的排列。如果沒有定義，返回空字串。

>>> import unicodedata
>>> print(unicodedata.bidirectional('9'))
EN
>>>
>>> print(unicodedata.bidirectional(u'u0660'))
AN
>>>
>>> print(unicodedata.bidirectional('中'))
L
>>>
>>> print(unicodedata.bidirectional('a'))
L
>>>
>>> print(unicodedata.category(u'u0660'))
Nd
>>>

其中EN表示English Number，AN表示Arabic Number，L表示Letter，Nd是表示Number Decimal。

unicodedata.combining(chr)

把字元的權威組合值返回，如果沒有定義，預設是返回0。當正規化操作時，可以根據這個值進行排序，大的值排在小的值後面。

>>> import unicodedata
>>> print(unicodedata.combining('9'))
0
>>>
>>> print(unicodedata.combining('A'))
0
>>>

unicodedata.east_asian_width(chr)

把字元顯示的寬度返回。具體內容如下：

‘F’(Fullwidth), ‘H’(Halfwidth), ‘W’(Wide), ‘Na’(Narrow), ‘A’(Ambiguous) or ‘N’(Natural).

>>> import unicodedata
>>> print(unicodedata.east_asian_width('9'))
Na
>>>
>>> print(unicodedata.east_asian_width('A'))
Na
>>>
>>> print(unicodedata.east_asian_width('蔡'))
W
>>>

unicodedata.mirrored(chr)

判斷一個字元是否支援映象屬性，如果支援返回1，否則返回0.

>>> import unicodedata
>>> print(unicodedata.mirrored('9'))
0
>>>
>>> print(unicodedata.mirrored('A'))
0
>>>
>>> print(unicodedata.mirrored('蔡'))
0
>>>

unicodedata.decomposition(chr)

把一個可分解的字元分成兩個16進位制的值返回，如果不可分解，返回空。

>>> import unicodedata
>>> print(unicodedata.decomposition('9'))

>>>
>>> print(unicodedata.decomposition('-'))

>>>
>>> print(unicodedata.decomposition('蔡'))

>>>
>>> print(unicodedata.decomposition('ガ'))
30AB 3099
>>>

unicodedata.normalize(form, unistr)

把一串UNICODE字串轉換為普通格式的字串，具體格式支援NFC、NFKC、NFD和NFKD格式。一些文字元素即可以使用靜態的預先組合好的形式，也可使用動態組合的形式。Unicode字元的不同表示序列被認為是等價的。如果兩個或多個序列被認為是等價的，Unicode標準不規定哪一種特定的序列是正確的，而認為每一個序列只不過與其它序列等價。

如果需要一種單一的單一的表示方式，可以使用一種規範化的Unicode文字形式來減少不想要區別。Unicode標準定義了四種規範化形式： Normalization Form D (NFD)，Normalization Form KD (NFKD)，Normalization Form C (NFC)，和Normalization Form KC (NFKC)。大約來說，NFD和NFKD將可能的字元進行分解，而NFC和NFKC將可能的字元進行組合。

>>> import unicodedata
>>> print(unicodedata.normalize('NFKD', u'aあä').encode('ascii', 'ignore'))
b'aa'
>>>

>>> title = u"Klüft skräms inför på fédéral électoral große"
>>> print title.encode(‘ascii','ignore')
Klft skrms infr p fdral lectoral groe
#可以看到丟了許多的字元
>>> import unicodedata 
>>> unicodedata.normalize('NFKD', title).encode('ascii','ignore') 
'Kluft skrams infor pa federal electoral groe'

unicodedata.unidata_version

返回當前Unicode使用的資料庫的版本。

unicodedata.ucd_3_2_0

提供ucd3.2的物件方式存取，以便相容舊的IDNA的應用程式。

>>> import unicodedata
>>> print(unicodedata.unidata_version)
9.0.0
>>>
>>> print(unicodedata.ucd_3_2_0)
<unicodedata.UCD object at 0x00000215E3EA3B70>
>>>

下面來仔細檢視一個字元的UNICODE資料：

U+0062 is the Unicode hex value of the character Latin Small Letter B, which is categorized as “lowercase letter” in the Unicode 6.0 character table.

Unicode Character Information

Unicode Hex U+0062

Character Name LATIN SMALL LETTER B

General Category Lowercase Letter [Code: Ll]

Canonical Combining Class 0

Bidirectional Category L

Mirrored N

Uppercase Version U+0042

Titlecase Version U+0042

Unicode Character Encodings

Latin Small Letter B HTML Entity b (decimal entity), b (hex entity)

Windows Key Code Alt 0098 or Alt +00621

Programming Source Code Encodings Python hex: u”u0062”, Hex for C++ and Java: “u0062”

UTF-8 Hexadecimal Encoding 0x62