In this guide, you'll explore Python's unicodedata module to work with Unicode characters. Learn its key functions and examples for handling Unicode data.
The unicodedata
module in Python provides access to the Unicode Character Database, which contains detailed information about every character defined in the Unicode standard. This module can be used to retrieve properties of Unicode characters, normalize Unicode strings, and perform various other operations related to Unicode data.
Table of Contents
- Introduction
unicodedata
Module Functionsunicodedata.lookup
unicodedata.name
unicodedata.decimal
unicodedata.digit
unicodedata.numeric
unicodedata.category
unicodedata.bidirectional
unicodedata.combining
unicodedata.mirrored
unicodedata.east_asian_width
unicodedata.decomposition
unicodedata.normalize
unicodedata.unidata_version
- More Examples
- Using
lookup
- Using
name
- Using
decimal
,digit
, andnumeric
- Using
category
- Using
bidirectional
- Using
combining
- Using
mirrored
- Using
east_asian_width
- Using
decomposition
- Using
normalize
- Using
- Real-World Use Case
- Conclusion
- References
Introduction
The unicodedata
module is a part of Python's standard library that allows you to work with Unicode data. Unicode is a standard for representing text in different writing systems.
This module provides various functions to query properties of Unicode characters, such as their names, categories, numeric values, and more. It also includes functions to normalize Unicode strings, which is essential for consistent text processing.
unicodedata Module Functions
unicodedata.lookup
Looks up a character by name and returns the corresponding character.
import unicodedata
char = unicodedata.lookup('GREEK SMALL LETTER ALPHA')
print(char)
Output:
α
unicodedata.name
Returns the name of a character. If no name is defined, raises a ValueError
.
import unicodedata
name = unicodedata.name('α')
print(name)
Output:
GREEK SMALL LETTER ALPHA
unicodedata.decimal
Returns the decimal value of a character. If no such value is defined, raises a ValueError
.
import unicodedata
decimal_value = unicodedata.decimal('5')
print(decimal_value)
Output:
5
unicodedata.digit
Returns the digit value of a character. If no such value is defined, raises a ValueError
.
import unicodedata
digit_value = unicodedata.digit('Ⅴ')
print(digit_value)
Output:
5
unicodedata.numeric
Returns the numeric value of a character. If no such value is defined, raises a ValueError
.
import unicodedata
numeric_value = unicodedata.numeric('⅕')
print(numeric_value)
Output:
0.2
unicodedata.category
Returns the general category assigned to the character.
import unicodedata
category = unicodedata.category('α')
print(category)
Output:
Ll
unicodedata.bidirectional
Returns the bidirectional class assigned to the character.
import unicodedata
bidi_class = unicodedata.bidirectional('α')
print(bidi_class)
Output:
L
unicodedata.combining
Returns the canonical combining class assigned to the character.
import unicodedata
combining_class = unicodedata.combining('́') # Combining acute accent
print(combining_class)
Output:
230
unicodedata.mirrored
Returns 1
if the character has the "mirrored" property, 0
otherwise.
import unicodedata
is_mirrored = unicodedata.mirrored('∑')
print(is_mirrored)
Output:
0
unicodedata.east_asian_width
Returns the east Asian width assigned to the character.
import unicodedata
east_asian_width = unicodedata.east_asian_width('か')
print(east_asian_width)
Output:
W
unicodedata.decomposition
Returns the Unicode decomposition of the character.
import unicodedata
decomposition = unicodedata.decomposition('½')
print(decomposition)
Output:
0031 2044 0032
unicodedata.normalize
Returns the normal form of a Unicode string.
import unicodedata
normalized_str = unicodedata.normalize('NFC', 'é')
print(normalized_str)
Output:
é
unicodedata.unidata_version
Returns the version of the Unicode Character Database used.
import unicodedata
version = unicodedata.unidata_version
print(version)
Output:
14.0.0
More Examples
Using lookup
import unicodedata
char = unicodedata.lookup('LATIN SMALL LETTER A')
print(char)
Output:
a
Using name
import unicodedata
name = unicodedata.name('a')
print(name)
Output:
LATIN SMALL LETTER A
Using decimal, digit, and numeric
import unicodedata
decimal_value = unicodedata.decimal('9')
digit_value = unicodedata.digit('Ⅳ')
numeric_value = unicodedata.numeric('½')
print(f"Decimal: {decimal_value}, Digit: {digit_value}, Numeric: {numeric_value}")
Output:
Decimal: 9, Digit: 4, Numeric: 0.5
Using category
import unicodedata
category = unicodedata.category('A')
print(category)
Output:
Lu
Using bidirectional
import unicodedata
bidi_class = unicodedata.bidirectional('A')
print(bidi_class)
Output:
L
Using combining
import unicodedata
combining_class = unicodedata.combining('́') # Combining acute accent
print(combining_class)
Output:
230
Using mirrored
import unicodedata
is_mirrored = unicodedata.mirrored('∑')
print(is_mirrored)
Output:
0
Using east_asian_width
import unicodedata
east_asian_width = unicodedata.east_asian_width('か')
print(east_asian_width)
Output:
W
Using decomposition
import unicodedata
decomposition = unicodedata.decomposition('½')
print(decomposition)
Output:
0031 2044 0032
Using normalize
import unicodedata
normalized_str = unicodedata.normalize('NFC', 'e\u0301')
print(normalized_str)
Output:
é
Real-World Use Case
Normalizing User Input
Normalize user input to ensure consistency in text processing.
import unicodedata
def normalize_input(user_input):
return unicodedata.normalize('NFC', user_input)
user_input = "e\u0301" # 'e' followed by combining acute accent
normalized = normalize_input(user_input)
print(normalized)
Output:
é
Conclusion
The unicodedata
module in Python provides comprehensive access to the Unicode Character Database, allowing for detailed querying and manipulation of Unicode characters. This module is essential for ensuring consistency and correctness in text processing, especially when dealing with internationalization and multilingual text.
Comments
Post a Comment
Leave Comment