Python unicodedata Module

In this guide, you'll explore Python's unicodedata module to work with Unicode characters. Learn its key functions and examples for handling Unicode data.

The unicodedata module in Python provides access to the Unicode Character Database, which contains detailed information about every character defined in the Unicode standard. This module can be used to retrieve properties of Unicode characters, normalize Unicode strings, and perform various other operations related to Unicode data.

Table of Contents

  1. Introduction
  2. unicodedata Module Functions
    • unicodedata.lookup
    • unicodedata.name
    • unicodedata.decimal
    • unicodedata.digit
    • unicodedata.numeric
    • unicodedata.category
    • unicodedata.bidirectional
    • unicodedata.combining
    • unicodedata.mirrored
    • unicodedata.east_asian_width
    • unicodedata.decomposition
    • unicodedata.normalize
    • unicodedata.unidata_version
  3. More Examples
    • Using lookup
    • Using name
    • Using decimal, digit, and numeric
    • Using category
    • Using bidirectional
    • Using combining
    • Using mirrored
    • Using east_asian_width
    • Using decomposition
    • Using normalize
  4. Real-World Use Case
  5. Conclusion
  6. References

Introduction

The unicodedata module is a part of Python's standard library that allows you to work with Unicode data. Unicode is a standard for representing text in different writing systems.

This module provides various functions to query properties of Unicode characters, such as their names, categories, numeric values, and more. It also includes functions to normalize Unicode strings, which is essential for consistent text processing.

unicodedata Module Functions

unicodedata.lookup

Looks up a character by name and returns the corresponding character.

import unicodedata

char = unicodedata.lookup('GREEK SMALL LETTER ALPHA')
print(char)

Output:

α

unicodedata.name

Returns the name of a character. If no name is defined, raises a ValueError.

import unicodedata

name = unicodedata.name('α')
print(name)

Output:

GREEK SMALL LETTER ALPHA

unicodedata.decimal

Returns the decimal value of a character. If no such value is defined, raises a ValueError.

import unicodedata

decimal_value = unicodedata.decimal('5')
print(decimal_value)

Output:

5

unicodedata.digit

Returns the digit value of a character. If no such value is defined, raises a ValueError.

import unicodedata

digit_value = unicodedata.digit('Ⅴ')
print(digit_value)

Output:

5

unicodedata.numeric

Returns the numeric value of a character. If no such value is defined, raises a ValueError.

import unicodedata

numeric_value = unicodedata.numeric('⅕')
print(numeric_value)

Output:

0.2

unicodedata.category

Returns the general category assigned to the character.

import unicodedata

category = unicodedata.category('α')
print(category)

Output:

Ll

unicodedata.bidirectional

Returns the bidirectional class assigned to the character.

import unicodedata

bidi_class = unicodedata.bidirectional('α')
print(bidi_class)

Output:

L

unicodedata.combining

Returns the canonical combining class assigned to the character.

import unicodedata

combining_class = unicodedata.combining('́')  # Combining acute accent
print(combining_class)

Output:

230

unicodedata.mirrored

Returns 1 if the character has the "mirrored" property, 0 otherwise.

import unicodedata

is_mirrored = unicodedata.mirrored('∑')
print(is_mirrored)

Output:

0

unicodedata.east_asian_width

Returns the east Asian width assigned to the character.

import unicodedata

east_asian_width = unicodedata.east_asian_width('か')
print(east_asian_width)

Output:

W

unicodedata.decomposition

Returns the Unicode decomposition of the character.

import unicodedata

decomposition = unicodedata.decomposition('½')
print(decomposition)

Output:

0031 2044 0032

unicodedata.normalize

Returns the normal form of a Unicode string.

import unicodedata

normalized_str = unicodedata.normalize('NFC', 'é')
print(normalized_str)

Output:

é

unicodedata.unidata_version

Returns the version of the Unicode Character Database used.

import unicodedata

version = unicodedata.unidata_version
print(version)

Output:

14.0.0

More Examples

Using lookup

import unicodedata

char = unicodedata.lookup('LATIN SMALL LETTER A')
print(char)

Output:

a

Using name

import unicodedata

name = unicodedata.name('a')
print(name)

Output:

LATIN SMALL LETTER A

Using decimal, digit, and numeric

import unicodedata

decimal_value = unicodedata.decimal('9')
digit_value = unicodedata.digit('Ⅳ')
numeric_value = unicodedata.numeric('½')

print(f"Decimal: {decimal_value}, Digit: {digit_value}, Numeric: {numeric_value}")

Output:

Decimal: 9, Digit: 4, Numeric: 0.5

Using category

import unicodedata

category = unicodedata.category('A')
print(category)

Output:

Lu

Using bidirectional

import unicodedata

bidi_class = unicodedata.bidirectional('A')
print(bidi_class)

Output:

L

Using combining

import unicodedata

combining_class = unicodedata.combining('́')  # Combining acute accent
print(combining_class)

Output:

230

Using mirrored

import unicodedata

is_mirrored = unicodedata.mirrored('∑')
print(is_mirrored)

Output:

0

Using east_asian_width

import unicodedata

east_asian_width = unicodedata.east_asian_width('か')
print(east_asian_width)

Output:

W

Using decomposition

import unicodedata

decomposition = unicodedata.decomposition('½')
print(decomposition)

Output:

0031 2044 0032

Using normalize

import unicodedata

normalized_str = unicodedata.normalize('NFC', 'e\u0301')
print(normalized_str)

Output:

é

Real-World Use Case

Normalizing User Input

Normalize user input to ensure consistency in text processing.

import unicodedata

def normalize_input(user_input):
    return unicodedata.normalize('NFC', user_input)

user_input = "e\u0301"  # 'e' followed by combining acute accent
normalized = normalize_input(user_input)
print(normalized)

Output:

é

Conclusion

The unicodedata module in Python provides comprehensive access to the Unicode Character Database, allowing for detailed querying and manipulation of Unicode characters. This module is essential for ensuring consistency and correctness in text processing, especially when dealing with internationalization and multilingual text.

References

Comments

Spring Boot 3 Paid Course Published for Free
on my Java Guides YouTube Channel

Subscribe to my YouTube Channel (165K+ subscribers):
Java Guides Channel

Top 10 My Udemy Courses with Huge Discount:
Udemy Courses - Ramesh Fadatare