In a [Python] Comparing Two Strings : Understanding Equality, Inequality, and Relational Operations, we explored various methods to compare strings. Now, we are taking a step further to examine longer pieces of text and pinpoint the exact differences. This blog post will guide you through using Python's difflib
module to analyze and compare substantial amounts of text, making it an invaluable tool for anyone dealing with complex data comparisons.
The SequenceMatcher
class from the difflib
module can be used to compare two strings and identify the ratio of similarity between them.
from difflib import SequenceMatcher
string1 = "The sun rises in the east\nA stitch in time saves nine\nAll that glitters is not gold"
string2 = "The sun sets in the west\nA stitch in time saves nine\nAll that shines is not gold\n"
s = SequenceMatcher(None, string1, string2)
similarity_ratio = s.ratio()
print(similarity_ratio) # Output: 0.9146341463414634
This code snippet demonstrates how to compare two strings and get a similarity ratio, a handy way to quickly identify minor discrepancies between texts.
The Differ
class in the difflib
module allows us to compare larger text documents line by line.
from difflib import Differ
text1 = string1.split('\n')
text2 = string2.split('\n')
d = Differ()
diff = d.compare(text1, text2)
print('\n'.join(diff))
Output
- The sun rises in the east
? -- -
+ The sun sets in the west
? + +
A stitch in time saves nine
- All that glitters is not gold
? ^^ ^^ -
+ All that shines is not gold
? ^^ ^
In the given code snippet, the Differ
class from the difflib
module is used to compare two sequences of strings, text1
and text2
. The output of the code highlights the differences between these two sequences, using specific symbols to indicate the nature of each difference:
-
indicate elements that are present in text1
but not in text2
.+
represent elements that are found in text2
but not in text1
.?
show the specific positions of changes within a line, using different characters to highlight where the changes occurred.Let's break down the provided output:
-
line indicates that "The sun rises in the east" is in text1
, but not in text2
. The +
line shows that "The sun sets in the west" is in text2
, but not in text1
. The ?
line pinpoints the exact characters that have changed.text1
and text2
, and so it doesn't have any symbol at the beginning.-
, +
, and ?
lines show that "All that glitters is not gold" in text1
has been replaced with "All that shines is not gold" in text2
, and they highlight the specific changes.difflib
module is intended for comparing sequences of lines of text.difflib
module is available in the Python standard library from version 2.1 onwards.HtmlDiff
class in the difflib
module to generate HTML side-by-side comparison tables.Differ
provide options to customize how the comparison is performed and how the output is generated.difflib.get_opcodes()
method to get information about blocks of changes, which can be useful for large texts.CloneCoding
Innovation Starts with a Single Line of Code!