Text Passage Similarity Algorithm

Github Repository

My sister-in-law was recently discussing a project of hers involving the preservation and restoration of digital copies of an old romance novel series she enjoyed. I was impressed by the thoroughness and care she showed in digitally remastering this series: refinishing cover art, creating text formatting, and internal chapter links. She mentioned to me that different digital copies of the book had different content also; automated transcription from scans sometimes got the wording or punctuation incorrect compared to an original copy and part of her work was manually rectifying these inconsistencies. I began to think of human-friendly measures of text similarites that could highlight where passages of text are similar and where they differ.

Although I'm sure there are precendented ways to do something like this, I first want to discuss ideas unaided to clairfy my own requirements for the algorithm and then research existing methods.

Attempt 1: Brute Force, Recursive

A good strategy when solving problems like these is to try the brute force approach first. What I thought might fit that description here is to pick a random point in the first text passage, use a brute force search in the second passage for a specific area surrounding that random point. If a match was not found, repeat. If a match was found, continue to expand left and right for both passages until a mismatch is found. At that point there are three sections of the text in both passages: before the matching segment, the matching segment, and after the matching segment. This process can be run recursively for the first and third of these.

Suprisingly, this solution worked quite well in a simple example and was quite performant! Here is the code that compared two passages after the passages had been read in from a file and all newlines converted to spaces.


def match_seg(A_real_loc, A, B_real_loc, B, segs, match_size = 15):
    if len(A) <= match_size:
        if A == B:
            segs.append(((A_real_loc, A_real_loc+len(A)), (B_real_loc, B_real_loc+len(B))))
            return
        else:
            return 
    A_random_loc = random.randint(0, len(A) - match_size - 1)
    AL = A_random_loc
    AR = A_random_loc+match_size
    A_seg = A[AL:AR]
    match_in_B = re.search(re.escape(A_seg), B)
    BL, BR = 0, len(B) - 1
    if match_in_B:
        BL, BR = match_in_B.start(), match_in_B.end()
        while A[AL:AR+1] == B[BL:BR+1] or A[AL-1:AR] == B[BL-1:BR]:
            if A[AL:AR+1] == B[BL:BR+1]:
                AR = AR + 1
                BR = BR + 1
            if A[AL-1:AR] == B[BL-1:BR]:
                AL = AL - 1
                BL = BL - 1
        # ASSUMPTION: match_size is long enough to ensure a unique match
        segs.append(((A_real_loc+AL, A_real_loc+AR), (B_real_loc+BL, B_real_loc+BR)))
        # ASSUMPTION: Matches to the left of segment in A will be in left of matching segment in B
        match_seg(A_real_loc, A[:AL], B_real_loc, B[:BL], segs, match_size)
        match_seg(A_real_loc+AR+1, A[AR+1:], B_real_loc+BR+1, B[BR+1:], segs, match_size)
    

Here is what the code is doing:

  1. Check if the base case has been reached. When the segment is smaller than what we consider enough length for a match, it will either match the segment from the second passage or not. If it does, we add the positions to the list of matching segments.
  2. Pick a random location in A, and retrieve a long enough string for matching (determined by the match_size variable).
  3. If this matches anywhere in B (re.search), expand the left and right bounds in both passages, until the segments no longer match and record the last matching segment bounds in the list of matching segments.
  4. Repeat algorithm for portions left and right of the matching segments

Since, for completely different tests and a worst-case random selection, there is O(n^2) complexity for this solution. Also, this is just part of the puzzle. I tested this on Heart Of Darkness from the Gutenberg Library; passage A was extraced from a .epub file and included lots of paragraph, italic, etc tags for formatting. Passage B was a .txt. From running this function, these were the only differences between the text, which is to say that the algorithm does not yet handle actually differences in content.

Another less clarified component to this the match_size. I picked 15 characters, because I thought would be a natural length to ensure uniqueness of a passage, but this could have a stronger theoretical basis.

Expanding Difference Detection