Python html Module

The html module in Python provides tools for handling HTML data, including escaping and unescaping HTML characters and parsing HTML documents. It is useful for web scraping, web development, and any application that needs to process HTML content.

Table of Contents

  1. Introduction
  2. Key Functions and Classes
    • html.escape
    • html.unescape
    • html.parser.HTMLParser
  3. Examples
    • Escaping HTML Characters
    • Unescaping HTML Characters
    • Basic HTML Parsing
  4. Real-World Use Case
  5. Conclusion
  6. References

Introduction

The html module provides functions for escaping and unescaping HTML special characters, as well as a base class for parsing HTML documents. This is essential for web scraping, web development, and data processing tasks involving HTML content.

Key Functions and Classes

html.escape

Escapes HTML characters in a string.

import html

escaped_string = html.escape('<div class="content">Hello, World!</div>')
print(escaped_string)

Output:

&lt;div class=&quot;content&quot;&gt;Hello, World!&lt;/div&gt;

html.unescape

Unescapes HTML characters in a string.

import html

unescaped_string = html.unescape('&lt;div class=&quot;content&quot;&gt;Hello, World!&lt;/div&gt;')
print(unescaped_string)

Output:

<div class="content">Hello, World!</div>

html.parser.HTMLParser

A base class for parsing HTML documents.

from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print("Start tag:", tag)
        for attr in attrs:
            print("     attr:", attr)

    def handle_endtag(self, tag):
        print("End tag  :", tag)

    def handle_data(self, data):
        print("Data     :", data)

parser = MyHTMLParser()
parser.feed('<div class="content">Hello, World!</div>')

Output:

Start tag: div
     attr: ('class', 'content')
Data     : Hello, World!
End tag  : div

Examples

Escaping HTML Characters

import html

html_string = '<div class="content">Hello, World!</div>'
escaped_string = html.escape(html_string)
print('Escaped:', escaped_string)

Output:

Escaped: &lt;div class=&quot;content&quot;&gt;Hello, World!&lt;/div&gt;

Unescaping HTML Characters

import html

escaped_string = '&lt;div class=&quot;content&quot;&gt;Hello, World!&lt;/div&gt;'
unescaped_string = html.unescape(escaped_string)
print('Unescaped:', unescaped_string)

Output:

Unescaped: <div class="content">Hello, World!</div>

Basic HTML Parsing

from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print("Start tag:", tag)
        for attr in attrs:
            print("     attr:", attr)

    def handle_endtag(self, tag):
        print("End tag  :", tag)

    def handle_data(self, data):
        print("Data     :", data)

parser = MyHTMLParser()
parser.feed('<div class="content">Hello, World!</div>')

Output:

Start tag: div
     attr: ('class', 'content')
Data     : Hello, World!
End tag  : div

Real-World Use Case

Extracting Links from HTML

from html.parser import HTMLParser

class LinkExtractor(HTMLParser):
    def __init__(self):
        super().__init__()
        self.links = []

    def handle_starttag(self, tag, attrs):
        if tag == 'a':
            for attr in attrs:
                if attr[0] == 'href':
                    self.links.append(attr[1])

html_content = '''
<html>
<body>
    <a href="http://example.com">Example</a>
    <a href="http://example.org">Example Org</a>
</body>
</html>
'''

parser = LinkExtractor()
parser.feed(html_content)
print('Extracted links:', parser.links)

Output:

Extracted links: ['http://example.com', 'http://example.org']

Conclusion

The html module in Python provides essential tools for handling HTML data. Whether you need to escape or unescape HTML characters or parse HTML documents, this module has the functionality you need for web scraping, web development, and data processing tasks involving HTML content.

References

Comments

Spring Boot 3 Paid Course Published for Free
on my Java Guides YouTube Channel

Subscribe to my YouTube Channel (165K+ subscribers):
Java Guides Channel

Top 10 My Udemy Courses with Huge Discount:
Udemy Courses - Ramesh Fadatare