PEAR Text_Diff doesn’t split words on punctuation

The PEAR Text_Diff system’s inline parser has a silly word splitting algorithm: it only defines word boundaries as spaces or newlines (\n).

This causes problems with punctuation. Suppose you are diffing the following two sentences:

The quick cat jumped over the lazy fox.
The quick cat jumped over the lazy dog.

The final rendered output will look like this:

The quick cat jumped over the lazy fox.dog.

Notice how the period is included in the word boundary? That makes messy markup. This comparison is worse:

The quick cat jumped over the lazy fox, who was totally lazy and should be shot.
The quick cat jumped over the lazy fox.

Here’s how PEAR Text_Diff does the diff:

The quick cat jumped over the lazy fox, who was totally lazy and should be shot.fox.

This final diff is difficult to read. You are not deleting and reinserting fox, you are in fact just changing the punctuation on its right. But because the inline diff renderer only considers space and newline as word boundaries, it doesn’t catch this basic punctuation issue.

The fix took me 1.5 hours of PHP code review to figure out the system, but it’s painfully easy to do it. Edit PEAR/Text/Diff/Renderer/inline.php. At lines 158 and 159 (per the online source code), you’ll see " \n" at the end. That is a collection of word boundaries, passed as a mask to the PHP strspn function. Simply add your word boundaries between the quotes, and the diff engine works correctly.

I’ve reported this as PHP PEAR bug 16774.

One thought on “PEAR Text_Diff doesn’t split words on punctuation”

Leave a Reply

Your email address will not be published. Required fields are marked *