[Python] Using difflib Module for String Comparison

In a [Python] Comparing Two Strings : Understanding Equality, Inequality, and Relational Operations, we explored various methods to compare strings. Now, we are taking a step further to examine longer pieces of text and pinpoint the exact differences. This blog post will guide you through using Python's difflib module to analyze and compare substantial amounts of text, making it an invaluable tool for anyone dealing with complex data comparisons.

Utilizing SequenceMatcher to Compare Strings

The SequenceMatcher class from the difflib module can be used to compare two strings and identify the ratio of similarity between them.

python
from difflib import SequenceMatcher

string1 = "The sun rises in the east\nA stitch in time saves nine\nAll that glitters is not gold"
string2 = "The sun sets in the west\nA stitch in time saves nine\nAll that shines is not gold\n"

s = SequenceMatcher(None, string1, string2)
similarity_ratio = s.ratio()

print(similarity_ratio)  # Output: 0.9146341463414634

This code snippet demonstrates how to compare two strings and get a similarity ratio, a handy way to quickly identify minor discrepancies between texts.

Leveraging Differ to Compare Text Line by Line

The Differ class in the difflib module allows us to compare larger text documents line by line.

python
from difflib import Differ

text1 = string1.split('\n')
text2 = string2.split('\n')

d = Differ()
diff = d.compare(text1, text2)

print('\n'.join(diff))

Output

- The sun rises in the east
?         --            -

+ The sun sets in the west
?           +         +

  A stitch in time saves nine

- All that glitters is not gold
?          ^^ ^^ -

+ All that shines is not gold
?          ^^ ^

In the given code snippet, the Differ class from the difflib module is used to compare two sequences of strings, text1 and text2. The output of the code highlights the differences between these two sequences, using specific symbols to indicate the nature of each difference:

  • Lines beginning with - indicate elements that are present in text1 but not in text2.
  • Lines starting with + represent elements that are found in text2 but not in text1.
  • Lines without any symbol at the beginning indicate elements that are identical in both sequences.
  • Lines beginning with ? show the specific positions of changes within a line, using different characters to highlight where the changes occurred.

Let's break down the provided output:

  1. The sun rises/sets line: The - line indicates that "The sun rises in the east" is in text1, but not in text2. The + line shows that "The sun sets in the west" is in text2, but not in text1. The ? line pinpoints the exact characters that have changed.
  2. The identical line: "A stitch in time saves nine" is the same in both text1 and text2, and so it doesn't have any symbol at the beginning.
  3. The glitters/shines line: Similar to the first line, the -, +, and ? lines show that "All that glitters is not gold" in text1 has been replaced with "All that shines is not gold" in text2, and they highlight the specific changes.

FAQs

  1. Can I use difflib to compare binary files? No, the difflib module is intended for comparing sequences of lines of text.
  2. Is difflib available in all versions of Python? The difflib module is available in the Python standard library from version 2.1 onwards.
  3. How can I use difflib for comparing HTML files? You can use the HtmlDiff class in the difflib module to generate HTML side-by-side comparison tables.
  4. Can I customize the output of the comparison? Yes, classes like Differ provide options to customize how the comparison is performed and how the output is generated.
  5. Is there a way to quickly spot changes between two large texts? Yes, you can use difflib.get_opcodes() method to get information about blocks of changes, which can be useful for large texts.
© Copyright 2023 CLONE CODING