A simple diff algorithm in PHP

A diff algorithm in its most basic form takes two strings, and returns the changes needed to make the old string into the new one. They are useful in comparing different versions of a document or file, to see at a glance what the differences are between the two. Wikipedia, for example, uses diffs to compare the changes between two revisions of the same article.

Solving the problem is not as simple as it seems, and the problem bothered me for about a year before I figured it out. I managed to write my algorithm in PHP, in 18 lines of code. It is not the most efficient way to do a diff, but it is probably the easiest to understand.

It works by finding the longest sequence of words common to both strings, and recursively finding the longest sequences of the remainders of the string until the substrings have no words in common. At this point it adds the remaining new words as an insertion and the remaining old words as a deletion.

You can download the source here: PHP SimpleDiff

23 Comments

  1. Bob Wildfong says:

    This is excellent work, Paul. Thanks for offering it. We’ve been looking for a nice, clean implementation of diff in PHP that is unconstrained by line breaks (we don’t care about where the line breaks are in HTML). We operate a non-profit project that allows volunteers to update our web site, and I need to see where the changes are so I can approve and activate them. I’ve chopped my users’ website updates on whitespace and ‘>’ tag ends, and used your code to present a very clear diff of their changes.

  2. Hamish M says:

    Nice work Paul, this could come in handy sometime :)

  3. Rajesh Kumar says:

    This is a great algorithm Paul, but I am trying to figure out how this is different from using PHP’s built-in array_diff() and array_intersect() functions to achieve this.

    Your post reminds me of the Edit Distance dynamic problem I first came across in this video from MIT:
    http://people.csail.mit.edu/bdean/6.046/dp/dp_8.swf

  4. Paul Butler says:

    Rajesh, interesting video. I’ve wondered how to best solve that problem before. I have a hunch that diffing two strings and finding the edit distance reduces to the same problem. However, I used a different algorithm than they used in the video. Their algorithm guarantees an optimal solution, so it might be fun to implement as well.

    array_diff() and array_intersect() give the intersection and difference of two sets. When you are comparing two pieces of text, for example, it doesn’t do you much good to know what words have been inserted or removed without knowing where they were inserted and removed from.

  5. Rajesh Kumar says:

    Ahh! I get it now. Thanks. I’m going to go away and try to implement a recursive algorithm that does the same thing but uses str_word_count($str, 2) which gives me the “where” bit.

  6. Paul Butler says:

    Rajesh, neat, I’ve never seen str_word_count before, but it should do the trick nicely. I used array_keys() to do the same thing since I was working with an array, but if you just want to diff strings it makes more sense to keep them in string format. Let me know how the implementation goes.

  7. _ck_ says:

    I’ve been studying your code and it’s very clever but unfortunately I am not as clever apparently! I am trying to figure out the correct (easy) method to use the diff array it generates to “roll back” the newer text to the old text so just the difference can to be stored.

    I think I am getting caught up on how the array key is the word position of the OLD text and not the NEW text? Is that correct? Is there a better way to handle this problem?

  8. Paul Butler says:

    Ck, the code is admittedly a bit hard to read. I found I had trouble understanding it myself when I recently read over it to make the Python version (which you should study instead if you know Python, since it is the better of the two). By easiest to understand, I meant the algorithm itself, not my cryptic implementation of it.

    The big for loop is to find the largest substring between new and old. The string variables could be swapped for one another in this loop and the output should still be the same.

  9. _ck_ says:

    Well essentially what I’m doing is taking the resulting array your subroutine generates. Then I loop through it an unset anything that’s not an array, so that leaves just the array of “diffs” which are indexed by the word position. That’s what I want to store so if there’s a 1000 word document, and one word inserted and one word deleted, only two small arrays are stored.

    The problem is, when I have only the new document to work backwards from, the index positions in your array are based on the old document. It’s occurred to me I could recurse and recount the words every time I effect a change from the array before continuing to the next set and that might “line-up” the word positions but it doesn’t seem to be working and the complexity of the resulting code is giving me a migrane…

    So I guess what I am really asking, is there a way to modify your routine so that it stores the index positions based on the new document which would make the reversal (undo) much easier for me? Perhaps after the forward-looking array is generated it could be converted into a reverse-looking one somehow?

    (sorry I don’t know the correct terminology, completely self-taught and an amateur at best!)

  10. Paul Butler says:

    I think calling diff with the arguments reversed (diff($new_string, $old_string)) might do what you want.

    You might also want to look into delta encoding: (http://en.wikipedia.org/wiki/Delta_encoding).

  11. _ck_ says:

    With fresh eyes (and migraine finally gone) I figured it out by studying some samples step-by-step. The resulting code to do rollbacks is so simple it’s embarrassing:

    ($diffdiff is the $diff with all non-array and blank elements removed but all other keys kept intact. $new is the current text, where we want rollback to $old which is our previous edit.)

    $old=explode(” “,$new);
    $offset=-1;
    foreach ($diffdiff as $key=>$value) {
    array_splice ( $old , $key+$offset , count($diffdiff[$key]['i']) , $diffdiff[$key]['d'] );
    $offset+=count($diffdiff[$key]['d']) -1;
    }
    $old=implode(” “,$old);

    tada!
    (I hope the blog software doesn’t eat the code)

  12. Paul Butler says:

    Wow, that is simple.. Thanks for posting it.

  13. kriplozoik says:

    I was currently looking for something just like that – simple, independent on other libraries, not hundreds of lines of code.

    Nice work and (as said above) thanks for releasing it.

  14. gbruins says:

    Great function Paul! Thought this might help someone out…here is the “diff” function and the presentation code to display the lines that are different, as well as the actual differences in each line highlighted:

    (hoping that my code will remain intact here)

    $oldValue){
    $newKeys = array_keys($new, $oldValue);

    foreach($newKeys as $newIndex){
    $matrix[$oldIndex][$newIndex] = isset($matrix[$oldIndex - 1][$newIndex - 1]) ?
    $matrix[$oldIndex - 1][$newIndex - 1] + 1 : 1;
    if($matrix[$oldIndex][$newIndex] > $maxlen){
    $maxlen = $matrix[$oldIndex][$newIndex];
    $oldMax = $oldIndex + 1 – $maxlen;
    $newMax = $newIndex + 1 – $maxlen;
    }
    }
    }

    if($maxlen == 0) {
    $diffArray = array(array(‘deleted’=>$old, ‘inserted’=>$new));
    } else {
    $diffArray = array_merge(
    diff(array_slice($old, 0, $oldMax), array_slice($new, 0, $newMax)),
    array_slice($new, $newMax, $maxlen),
    diff(array_slice($old, $oldMax + $maxlen), array_slice($new, $newMax + $maxlen))
    );
    }
    return $diffArray;
    }

    $diffArray = diff($f2, $f1);
    ?>

    #diffTable {
    border-collapse: collapse;
    }

    #diffTable td {
    padding: 1px 5px 1px 10px;
    vertical-align: top;
    }

    td.lineNumber {
    border-right: 1px solid #ccc;
    }

    span.strDiff {
    background-color: #B6FF9F;
    padding: 0;
    margin: 0;
    }

    $diff) {
    if(!is_array($diff)) {
    ?>


     

    count($diff['inserted']) ? $diff['deleted'] : $diff['inserted'];
    foreach($biggest as $diffKey => $diffVal) {
    if(isset($diff['deleted'][$diffKey]) &&
    isset($diff['inserted'][$diffKey]) &&
    trim($diff['deleted'][$diffKey]) == trim($diff['inserted'][$diffKey])) {
    $newStr1 = $diff['deleted'][$diffKey];
    $newStr2 = $diff['inserted'][$diffKey];
    $backgroundStyle = “”;
    } else {
    //explode the strings so we can highlight the differences between strings
    $str1 = explode(” “, $diff['deleted'][$diffKey]);
    $str2 = explode(” “, $diff['inserted'][$diffKey]);
    $newStr1 = “”;
    $newStr2 = “”;

    if(count($str1) >= count($str2)) {
    foreach($str1 as $strKey => $strVal) {
    if(isset($str1[$strKey]) && isset($str2[$strKey])) {
    if($str1[$strKey] != $str2[$strKey]) {
    $newStr1 .= ‘ ‘.$str1[$strKey].”;
    $newStr2 .= ‘ ‘.$str2[$strKey].”;
    } else {
    $newStr1 .= ‘ ‘.$str1[$strKey];
    $newStr2 .= ‘ ‘.$str2[$strKey];
    }
    } elseif(isset($str1[$strKey]) && !isset($str2[$strKey])) {
    $newStr1 .= ‘ ‘.$str1[$strKey];
    $newStr2 .= ”;
    } else {
    $newStr1 .= ”;
    $newStr2 .= ‘ ‘.$str2[$strKey];
    }
    }
    } else {
    foreach($str2 as $strKey => $strVal) {
    if(isset($str2[$strKey]) && isset($str1[$strKey])) {
    if($str2[$strKey] != $str1[$strKey]) {
    $newStr1 .= ‘ ‘.$str1[$strKey].”;
    $newStr2 .= ‘ ‘.$str2[$strKey].”;
    } elseif(isset($str2[$strKey]) && !isset($str1[$strKey])) {
    $newStr2 .= ‘ ‘.$str2[$strKey];
    $newStr1 .= ”;
    } else {
    $newStr2 .= ”;
    $newStr1 .= ‘ ‘.$str1[$strKey];
    }
    }
    }
    }
    $backgroundStyle = “background-color: #FFCFCF;”;
    }
    ?>


    “>


    “>


     


    “>


    “>

  15. gbruins says:

    oops that didn’t work out so well. I guess Ill try emailing the code to Paul so he can post it if he wants to. Sorry!

  16. Paul Butler says:

    Yeah, unfortunately WordPress doesn’t work well with code in comments. Thanks for e-mailing the code, I’ll have a look.

  17. Henry says:

    How hard would it be to adapt this to display the strings side by side with differences highlighted, similar to the wikimedia difference engine?

  18. Chris says:

    Hi Paul,

    I was told to manage a website that actually already uses your algorithm.
    After redesigning the program logic and database to handle UTF8-data instead of iso latin1, I discover line breaks being displayed in a wrong way.

    Is your algorithm UTF8-safe?

    regards, Chris

  19. Niraj says:

    A beautiful use of recursion to solve the problem of computing diff!!

  20. Tehem says:

    Great code, really useful. I was searching for a text diff without having to install PEAR stuff, that’s it. Thanks a LOT ! I’m very grateful.

  21. Jens says:

    Paul, this is really a good portion of code in a nice way – thanks a lot. Could you please give us a link to the snipped gbruins mailed you? I am very interested in it. You could send it via mail too.
    Wish you a happy new year!

Leave a Reply