Python difflib Module - A Complete Guide

In this guide, you'll explore the difflib module in Python, which helps compare and highlight differences between sequences. We’ll cover its key functions, classes, use cases, and examples to help you use it efficiently.

The difflib module in Python provides classes and functions for comparing sequences, such as strings or lists, and generating differences (diffs) between them. This module is useful for tasks like comparing text files, computing deltas, and producing human-readable differences.

Table of Contents

  1. Introduction
  2. SequenceMatcher Class
    • Methods
      • __init__
      • set_seq1
      • set_seq2
      • set_seqs
      • find_longest_match
      • get_matching_blocks
      • get_opcodes
      • ratio
      • quick_ratio
      • real_quick_ratio
  3. Differ Class
    • Methods
      • compare
  4. HtmlDiff Class
    • Methods
      • make_file
      • make_table
  5. Utility Functions
    • context_diff
    • unified_diff
    • ndiff
    • restore
    • IS_CHARACTER_JUNK
    • IS_LINE_JUNK
  6. Examples
    • Using SequenceMatcher
    • Using Differ
    • Using HtmlDiff
    • Using Utility Functions
  7. Real-World Use Case
  8. Conclusion
  9. References

Introduction

The difflib module provides a variety of classes and functions to compare sequences, find differences, and produce human-readable diff outputs. This is particularly useful for comparing text files, generating patches, or implementing features like version control systems.

SequenceMatcher Class

The SequenceMatcher class compares pairs of sequences of any type and generates information about how they differ.

Methods

__init__

Initializes a SequenceMatcher object.

import difflib

s = difflib.SequenceMatcher(isjunk=None, a='', b='')
  • isjunk: A function that takes a sequence element and returns True if it is junk.
  • a: The first sequence to compare.
  • b: The second sequence to compare.

set_seq1

Sets the first sequence to be compared.

s.set_seq1('new_sequence')

set_seq2

Sets the second sequence to be compared.

s.set_seq2('new_sequence')

set_seqs

Sets both sequences to be compared.

s.set_seqs('sequence1', 'sequence2')

find_longest_match

Finds the longest contiguous matching subsequence.

match = s.find_longest_match(0, len(s.a), 0, len(s.b))
print(match)  # Match object with attributes (i, j, size)

get_matching_blocks

Returns a list of triples describing matching subsequences.

matches = s.get_matching_blocks()
print(matches)

get_opcodes

Returns a list of 5-tuples describing how to turn the first sequence into the second.

opcodes = s.get_opcodes()
print(opcodes)

ratio

Returns a measure of the sequences' similarity as a float in the range [0, 1].

similarity = s.ratio()
print(similarity)

quick_ratio

Returns an upper bound on ratio() relatively quickly.

quick_ratio = s.quick_ratio()
print(quick_ratio)

real_quick_ratio

Returns an upper bound on ratio() very quickly.

real_quick_ratio = s.real_quick_ratio()
print(real_quick_ratio)

Differ Class

The Differ class computes the difference between two sequences.

Methods

compare

Compares two sequences of lines, generating human-readable differences.

import difflib

d = difflib.Differ()
diff = d.compare('one\ntwo\nthree\n'.splitlines(), 'ore\ntwo\nthree\n'.splitlines())
print('\n'.join(diff))

HtmlDiff Class

The HtmlDiff class generates HTML side-by-side comparison with change highlights.

Methods

make_file

Creates an HTML file with the differences between two sequences.

import difflib

hd = difflib.HtmlDiff()
html = hd.make_file('one\ntwo\nthree\n'.splitlines(), 'ore\ntwo\nthree\n'.splitlines(), context=True, numlines=1)
print(html)

make_table

Creates an HTML table with the differences between two sequences.

html_table = hd.make_table('one\ntwo\nthree\n'.splitlines(), 'ore\ntwo\nthree\n'.splitlines(), context=True, numlines=1)
print(html_table)

Utility Functions

context_diff

Generates context differences.

import difflib

diff = difflib.context_diff('one\ntwo\nthree\n'.splitlines(), 'ore\ntwo\nthree\n'.splitlines(), lineterm='')
print('\n'.join(diff))

unified_diff

Generates unified differences.

import difflib

diff = difflib.unified_diff('one\ntwo\nthree\n'.splitlines(), 'ore\ntwo\nthree\n'.splitlines(), lineterm='')
print('\n'.join(diff))

ndiff

Generates a delta from two sequences of lines.

import difflib

diff = difflib.ndiff('one\ntwo\nthree\n'.splitlines(), 'ore\ntwo\nthree\n'.splitlines())
print('\n'.join(diff))

restore

Generates one of the two sequences from a delta.

import difflib

delta = list(difflib.ndiff('one\ntwo\nthree\n'.splitlines(), 'ore\ntwo\nthree\n'.splitlines()))
restored = difflib.restore(delta, 1)
print('\n'.join(restored))

IS_CHARACTER_JUNK

Returns True for whitespace characters, False otherwise.

import difflib

print(difflib.IS_CHARACTER_JUNK(' '))

IS_LINE_JUNK

Returns True for lines that are all whitespace, False otherwise.

import difflib

print(difflib.IS_LINE_JUNK('   '))

Examples

Using SequenceMatcher

import difflib

s = difflib.SequenceMatcher(None, "abcdef", "abcfgh")
print("Similarity ratio:", s.ratio())

Output:

Similarity ratio: 0.6666666666666666

Using Differ

import difflib

d = difflib.Differ()
diff = d.compare("one\ntwo\nthree\n".splitlines(), "ore\ntwo\nthree\n".splitlines())
print('\n'.join(diff))

Output:

- one
+ ore
  two
  three

Using HtmlDiff

import difflib

hd = difflib.HtmlDiff()
html = hd.make_file("one\ntwo\nthree\n".splitlines(), "ore\ntwo\nthree\n".splitlines())
print(html)

Using Utility Functions

context_diff

import difflib

diff = difflib.context_diff("one\ntwo\nthree\n".splitlines(), "ore\ntwo\nthree\n".splitlines(), lineterm='')
print('\n'.join(diff))

Output:

*** 
--- 
***************
*** 1,3 ****
! one
  two
  three
--- 1,3 ----
! ore
  two
  three

unified_diff

import difflib

diff = difflib.unified_diff("one\ntwo\nthree\n".splitlines(), "ore\ntwo\nthree\n".splitlines(), lineterm='')
print('\n'.join(diff))

Output:

--- 
+++ 
@@ -1,3 +1,3 @@
- one
+ ore
  two
  three

ndiff

import difflib

diff = difflib.ndiff("one\ntwo\nthree\n".splitlines(), "ore\ntwo\nthree\n".splitlines())
print('\n'.join(diff))

Output:

- one
?  ^
+ ore
?  ^
  two
  three

restore

import difflib

delta = list(difflib.ndiff("one\ntwo\nthree\n".splitlines(), "ore\ntwo\nthree\n".splitlines()))
restored = difflib.restore(delta, 1)
print('\n'.join(restored))

Output:

ore
two
three

Real-World Use Case

Generating Diffs for Version Control

Use the difflib module to generate diffs for a simple version control system.

import difflib 

def generate_diff(old_text, new_text):
    diff = difflib.unified_diff(old_text.splitlines(), new_text.splitlines(), lineterm='')
    return '\n'.join(diff)

old_version = "one\ntwo\nthree\n"
new_version = (
    "ore\ntwo\nthree\n"
)

diff = generate_diff(old_version, new_version)
print(diff)

Output:

--- 
+++ 
@@ -1,3 +1,3 @@
- one
+ ore
  two
  three

Conclusion

The difflib module in Python provides built-in classes and functions for comparing sequences and generating human-readable differences. Whether you need to compare text files, implement a simple version control system, or generate HTML diffs, the difflib module has you covered.

References

Comments

Spring Boot 3 Paid Course Published for Free
on my Java Guides YouTube Channel

Subscribe to my YouTube Channel (165K+ subscribers):
Java Guides Channel

Top 10 My Udemy Courses with Huge Discount:
Udemy Courses - Ramesh Fadatare