The html
module in Python provides tools for handling HTML data, including escaping and unescaping HTML characters and parsing HTML documents. It is useful for web scraping, web development, and any application that needs to process HTML content.
Table of Contents
- Introduction
- Key Functions and Classes
html.escape
html.unescape
html.parser.HTMLParser
- Examples
- Escaping HTML Characters
- Unescaping HTML Characters
- Basic HTML Parsing
- Real-World Use Case
- Conclusion
- References
Introduction
The html
module provides functions for escaping and unescaping HTML special characters, as well as a base class for parsing HTML documents. This is essential for web scraping, web development, and data processing tasks involving HTML content.
Key Functions and Classes
html.escape
Escapes HTML characters in a string.
import html
escaped_string = html.escape('<div class="content">Hello, World!</div>')
print(escaped_string)
Output:
<div class="content">Hello, World!</div>
html.unescape
Unescapes HTML characters in a string.
import html
unescaped_string = html.unescape('<div class="content">Hello, World!</div>')
print(unescaped_string)
Output:
<div class="content">Hello, World!</div>
html.parser.HTMLParser
A base class for parsing HTML documents.
from html.parser import HTMLParser
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
print("Start tag:", tag)
for attr in attrs:
print(" attr:", attr)
def handle_endtag(self, tag):
print("End tag :", tag)
def handle_data(self, data):
print("Data :", data)
parser = MyHTMLParser()
parser.feed('<div class="content">Hello, World!</div>')
Output:
Start tag: div
attr: ('class', 'content')
Data : Hello, World!
End tag : div
Examples
Escaping HTML Characters
import html
html_string = '<div class="content">Hello, World!</div>'
escaped_string = html.escape(html_string)
print('Escaped:', escaped_string)
Output:
Escaped: <div class="content">Hello, World!</div>
Unescaping HTML Characters
import html
escaped_string = '<div class="content">Hello, World!</div>'
unescaped_string = html.unescape(escaped_string)
print('Unescaped:', unescaped_string)
Output:
Unescaped: <div class="content">Hello, World!</div>
Basic HTML Parsing
from html.parser import HTMLParser
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
print("Start tag:", tag)
for attr in attrs:
print(" attr:", attr)
def handle_endtag(self, tag):
print("End tag :", tag)
def handle_data(self, data):
print("Data :", data)
parser = MyHTMLParser()
parser.feed('<div class="content">Hello, World!</div>')
Output:
Start tag: div
attr: ('class', 'content')
Data : Hello, World!
End tag : div
Real-World Use Case
Extracting Links from HTML
from html.parser import HTMLParser
class LinkExtractor(HTMLParser):
def __init__(self):
super().__init__()
self.links = []
def handle_starttag(self, tag, attrs):
if tag == 'a':
for attr in attrs:
if attr[0] == 'href':
self.links.append(attr[1])
html_content = '''
<html>
<body>
<a href="http://example.com">Example</a>
<a href="http://example.org">Example Org</a>
</body>
</html>
'''
parser = LinkExtractor()
parser.feed(html_content)
print('Extracted links:', parser.links)
Output:
Extracted links: ['http://example.com', 'http://example.org']
Conclusion
The html
module in Python provides essential tools for handling HTML data. Whether you need to escape or unescape HTML characters or parse HTML documents, this module has the functionality you need for web scraping, web development, and data processing tasks involving HTML content.
Comments
Post a Comment
Leave Comment