A simple diff algorithm in PHP

A diff algorithm in its most basic form takes two strings, and returns the changes needed to make the old string into the new one. They are useful in comparing different versions of a document or file, to see at a glance what the differences are between the two. Wikipedia, for example, uses diffs to compare the changes between two revisions of the same article.

Solving the problem is not as simple as it seems, and the problem bothered me for about a year before I figured it out. I managed to write my algorithm in PHP, in 18 lines of code. It is not the most efficient way to do a diff, but it is probably the easiest to understand.

It works by finding the longest sequence of words common to both strings, and recursively finding the longest sequences of the remainders of the string until the substrings have no words in common. At this point it adds the remaining new words as an insertion and the remaining old words as a deletion.

You can download the source here: PHP SimpleDiff

This entry was posted in PHP. Bookmark the permalink.

51 Responses to A simple diff algorithm in PHP

  1. Jeff Kee says:

    WOW very impressive!

  2. Bob Wildfong says:

    This is excellent work, Paul. Thanks for offering it. We’ve been looking for a nice, clean implementation of diff in PHP that is unconstrained by line breaks (we don’t care about where the line breaks are in HTML). We operate a non-profit project that allows volunteers to update our web site, and I need to see where the changes are so I can approve and activate them. I’ve chopped my users’ website updates on whitespace and ‘>’ tag ends, and used your code to present a very clear diff of their changes.

  3. Hamish M says:

    Nice work Paul, this could come in handy sometime :)

  4. Rajesh Kumar says:

    This is a great algorithm Paul, but I am trying to figure out how this is different from using PHP’s built-in array_diff() and array_intersect() functions to achieve this.

    Your post reminds me of the Edit Distance dynamic problem I first came across in this video from MIT:
    http://people.csail.mit.edu/bdean/6.046/dp/dp_8.swf

  5. Paul Butler says:

    Rajesh, interesting video. I’ve wondered how to best solve that problem before. I have a hunch that diffing two strings and finding the edit distance reduces to the same problem. However, I used a different algorithm than they used in the video. Their algorithm guarantees an optimal solution, so it might be fun to implement as well.

    array_diff() and array_intersect() give the intersection and difference of two sets. When you are comparing two pieces of text, for example, it doesn’t do you much good to know what words have been inserted or removed without knowing where they were inserted and removed from.

  6. Rajesh Kumar says:

    Ahh! I get it now. Thanks. I’m going to go away and try to implement a recursive algorithm that does the same thing but uses str_word_count($str, 2) which gives me the “where” bit.

  7. Paul Butler says:

    Rajesh, neat, I’ve never seen str_word_count before, but it should do the trick nicely. I used array_keys() to do the same thing since I was working with an array, but if you just want to diff strings it makes more sense to keep them in string format. Let me know how the implementation goes.

  8. _ck_ says:

    I’ve been studying your code and it’s very clever but unfortunately I am not as clever apparently! I am trying to figure out the correct (easy) method to use the diff array it generates to “roll back” the newer text to the old text so just the difference can to be stored.

    I think I am getting caught up on how the array key is the word position of the OLD text and not the NEW text? Is that correct? Is there a better way to handle this problem?

  9. Paul Butler says:

    Ck, the code is admittedly a bit hard to read. I found I had trouble understanding it myself when I recently read over it to make the Python version (which you should study instead if you know Python, since it is the better of the two). By easiest to understand, I meant the algorithm itself, not my cryptic implementation of it.

    The big for loop is to find the largest substring between new and old. The string variables could be swapped for one another in this loop and the output should still be the same.

  10. _ck_ says:

    Well essentially what I’m doing is taking the resulting array your subroutine generates. Then I loop through it an unset anything that’s not an array, so that leaves just the array of “diffs” which are indexed by the word position. That’s what I want to store so if there’s a 1000 word document, and one word inserted and one word deleted, only two small arrays are stored.

    The problem is, when I have only the new document to work backwards from, the index positions in your array are based on the old document. It’s occurred to me I could recurse and recount the words every time I effect a change from the array before continuing to the next set and that might “line-up” the word positions but it doesn’t seem to be working and the complexity of the resulting code is giving me a migrane…

    So I guess what I am really asking, is there a way to modify your routine so that it stores the index positions based on the new document which would make the reversal (undo) much easier for me? Perhaps after the forward-looking array is generated it could be converted into a reverse-looking one somehow?

    (sorry I don’t know the correct terminology, completely self-taught and an amateur at best!)

  11. Paul Butler says:

    I think calling diff with the arguments reversed (diff($new_string, $old_string)) might do what you want.

    You might also want to look into delta encoding: (http://en.wikipedia.org/wiki/Delta_encoding).

  12. _ck_ says:

    With fresh eyes (and migraine finally gone) I figured it out by studying some samples step-by-step. The resulting code to do rollbacks is so simple it’s embarrassing:

    ($diffdiff is the $diff with all non-array and blank elements removed but all other keys kept intact. $new is the current text, where we want rollback to $old which is our previous edit.)

    $old=explode(” “,$new);
    $offset=-1;
    foreach ($diffdiff as $key=>$value) {
    array_splice ( $old , $key+$offset , count($diffdiff[$key]['i']) , $diffdiff[$key]['d'] );
    $offset+=count($diffdiff[$key]['d']) -1;
    }
    $old=implode(” “,$old);

    tada!
    (I hope the blog software doesn’t eat the code)

    • kibuc says:

      Mate, If your code actually does the job then I owe you big time! I’ve been doing the same work for my academic project for a couple of days lately: getting diff array, stripping it of non-changes and storing in a db for a small version control system I’m building for my app. The only difference is, I’m keeping the original ‘old’ file so what i want to do is to generate the new one basing on the original file and the array of changes (or arrays – for further revisions). Unfortunately, the indexes in the array get all messed up and don’t correspond neither to the old or new file if there is multiple line deletion or insertion in succeeding lines (to be precise – if there number of lines inserted and deleted in one step does not match) and I took me ages to deal with it. Still, I’m only getting the first revision right (the one based on the original file). If I’m trying to get further revisions, meaning I’m not working on the original file anymore but on the one generated from it, It all goes messy again :( I’ll check if your code works for my project as soon as I get back home. I hope I’m free to use it for non-commercial use, since it’s posted on the forum?

  13. Paul Butler says:

    Wow, that is simple.. Thanks for posting it.

  14. kriplozoik says:

    I was currently looking for something just like that – simple, independent on other libraries, not hundreds of lines of code.

    Nice work and (as said above) thanks for releasing it.

  15. gbruins says:

    Great function Paul! Thought this might help someone out…here is the “diff” function and the presentation code to display the lines that are different, as well as the actual differences in each line highlighted:

    (hoping that my code will remain intact here)

    $oldValue){
    $newKeys = array_keys($new, $oldValue);

    foreach($newKeys as $newIndex){
    $matrix[$oldIndex][$newIndex] = isset($matrix[$oldIndex - 1][$newIndex - 1]) ?
    $matrix[$oldIndex - 1][$newIndex - 1] + 1 : 1;
    if($matrix[$oldIndex][$newIndex] > $maxlen){
    $maxlen = $matrix[$oldIndex][$newIndex];
    $oldMax = $oldIndex + 1 – $maxlen;
    $newMax = $newIndex + 1 – $maxlen;
    }
    }
    }

    if($maxlen == 0) {
    $diffArray = array(array(‘deleted’=>$old, ‘inserted’=>$new));
    } else {
    $diffArray = array_merge(
    diff(array_slice($old, 0, $oldMax), array_slice($new, 0, $newMax)),
    array_slice($new, $newMax, $maxlen),
    diff(array_slice($old, $oldMax + $maxlen), array_slice($new, $newMax + $maxlen))
    );
    }
    return $diffArray;
    }

    $diffArray = diff($f2, $f1);
    ?>

    #diffTable {
    border-collapse: collapse;
    }

    #diffTable td {
    padding: 1px 5px 1px 10px;
    vertical-align: top;
    }

    td.lineNumber {
    border-right: 1px solid #ccc;
    }

    span.strDiff {
    background-color: #B6FF9F;
    padding: 0;
    margin: 0;
    }

    $diff) {
    if(!is_array($diff)) {
    ?>


     

    count($diff['inserted']) ? $diff['deleted'] : $diff['inserted'];
    foreach($biggest as $diffKey => $diffVal) {
    if(isset($diff['deleted'][$diffKey]) &&
    isset($diff['inserted'][$diffKey]) &&
    trim($diff['deleted'][$diffKey]) == trim($diff['inserted'][$diffKey])) {
    $newStr1 = $diff['deleted'][$diffKey];
    $newStr2 = $diff['inserted'][$diffKey];
    $backgroundStyle = “”;
    } else {
    //explode the strings so we can highlight the differences between strings
    $str1 = explode(” “, $diff['deleted'][$diffKey]);
    $str2 = explode(” “, $diff['inserted'][$diffKey]);
    $newStr1 = “”;
    $newStr2 = “”;

    if(count($str1) >= count($str2)) {
    foreach($str1 as $strKey => $strVal) {
    if(isset($str1[$strKey]) && isset($str2[$strKey])) {
    if($str1[$strKey] != $str2[$strKey]) {
    $newStr1 .= ‘ ‘.$str1[$strKey].”;
    $newStr2 .= ‘ ‘.$str2[$strKey].”;
    } else {
    $newStr1 .= ‘ ‘.$str1[$strKey];
    $newStr2 .= ‘ ‘.$str2[$strKey];
    }
    } elseif(isset($str1[$strKey]) && !isset($str2[$strKey])) {
    $newStr1 .= ‘ ‘.$str1[$strKey];
    $newStr2 .= ”;
    } else {
    $newStr1 .= ”;
    $newStr2 .= ‘ ‘.$str2[$strKey];
    }
    }
    } else {
    foreach($str2 as $strKey => $strVal) {
    if(isset($str2[$strKey]) && isset($str1[$strKey])) {
    if($str2[$strKey] != $str1[$strKey]) {
    $newStr1 .= ‘ ‘.$str1[$strKey].”;
    $newStr2 .= ‘ ‘.$str2[$strKey].”;
    } elseif(isset($str2[$strKey]) && !isset($str1[$strKey])) {
    $newStr2 .= ‘ ‘.$str2[$strKey];
    $newStr1 .= ”;
    } else {
    $newStr2 .= ”;
    $newStr1 .= ‘ ‘.$str1[$strKey];
    }
    }
    }
    }
    $backgroundStyle = “background-color: #FFCFCF;”;
    }
    ?>


    “>


    “>


     


    “>


    “>

  16. gbruins says:

    oops that didn’t work out so well. I guess Ill try emailing the code to Paul so he can post it if he wants to. Sorry!

  17. Paul Butler says:

    Yeah, unfortunately WordPress doesn’t work well with code in comments. Thanks for e-mailing the code, I’ll have a look.

  18. Henry says:

    How hard would it be to adapt this to display the strings side by side with differences highlighted, similar to the wikimedia difference engine?

  19. Willem says:

    Great script, thanks!

  20. Chris says:

    Hi Paul,

    I was told to manage a website that actually already uses your algorithm.
    After redesigning the program logic and database to handle UTF8-data instead of iso latin1, I discover line breaks being displayed in a wrong way.

    Is your algorithm UTF8-safe?

    regards, Chris

  21. Niraj says:

    A beautiful use of recursion to solve the problem of computing diff!!

  22. Tehem says:

    Great code, really useful. I was searching for a text diff without having to install PEAR stuff, that’s it. Thanks a LOT ! I’m very grateful.

  23. Jens says:

    Paul, this is really a good portion of code in a nice way – thanks a lot. Could you please give us a link to the snipped gbruins mailed you? I am very interested in it. You could send it via mail too.
    Wish you a happy new year!

  24. Peter West says:

    I wrote a diff algorithm a few days ago, but because I’m a perfectionist it had exponentially bad performance (it was essentially a brute force algorithm). So I scrapped mine and spent a few hours working out yours. I never tire of seeing a really good recursive function! After some testing I’m satisfied the performance is really excellent.

    Anyway your diff algorithm was not quite suitable for my purposes so I’ve made some changes and additions here:
    http://piratepad.net/ep/pad/view/IY48xXzsys/syZUX3SzwI

    Notably I’m using preg_split() with a zero length lookahead match, allowing the string to be split by any combination of characters (I’ve found that ” “, “\n” and “.” are a good combination) or by each character using “”.

    This also helped with the other feature I’ve added: sequentialChanges() which converts a diff into a list of sequential changes and renderChanges() which applies those changes (not having missing spaces everywhere made this a lot easier).

    Here’s an example:
    $diff = new Diff
    $old = “foo bar zim”
    $new = “zim bar gir”
    $result = $diff->stringDiff($old, $new, ” “)
    $changes = $diff->sequentialChanges($result)
    $diff->renderChanges($changes, $old)

    The final line takes $old and uses the list of changes to recreate $new. I’m planning to use this as version control for my blog. I’ll also put this on github when I have time.

  25. Math says:

    Whaaaa, great ! Thanks for sharing !

  26. heinetz says:

    after i tried out class pear:text_diff i found htis small script and for me it does the same. i like such handy code!

    … but I don’t really know, if it does, what i need. I tried to compare file data5.txt with data6.txt but the result’s no valid html.

    Do I missunderstand, how to use the function ?

    thanks for tipps and

    best,
    heinetz

  27. Jim says:

    That’s a cracking script. And having spent some (considerable) time trying to do the same code myself : respect !

  28. Tony says:

    Someone has stolen your algorithm and claimed it as their own. Alright, it’s under MIT and they’re not selling it, but it’s still *yours*, right?

    http://github.com/YorickPeterse/Tiny-Diff/blob/master/tiny_diff.php#L108

  29. schoschie says:

    Brilliant! After I realized that my head is going to explode trying to work out my own diff algorithm (it’s seems to be trivial at first, but the more you try to actually solve it, the more intricate it gets…), I googled around a little and quickly found this algorithm, with is so short and works perfectly! I’ve plugged into a little CMS I’m working on (not-paid-for in-house system for an art gallery) and it’s just perfect.

    The only changes I had to make is initialize 2 variables from which I got warnings (using error_reporting(-1) during development.

    See an example diff: http://img.ly/2VbJ

    Thank you!

  30. David says:

    Very nice! It is a simple, easy to use piece that seems to be needed more and more in applications as we monitor content.

  31. Pingback: PHP: Funktion zum Anzeigen von Textunterschieden (PHP diff Funktion) | DAVID GEBHARDT

  32. Oskar says:

    Great article! However there seems to be one (or more) bug(s) in the algoritm. First of all I’m using UTF-8 documents, I’m not sure if that matters though. The bug is where the and tags are placed. I added this code to one document:

    Nån som vet någon likadan/bättre tjänst som/än dropbox?Sun, 06 Mar 2011 15:30:54 +0000

    but the resulting html is:

    Nån som vet någon likadan/bättre tjänst som/än dropbox?Sun, 06 Mar 2011 15:30:54 +0000

    Which isn’t correct html, if it matters I duplicated one of my “news-posts” from my website.

  33. Oskar says:

    Great article! However there seems to be one (or more) bug(s) in the algoritm. First of all I’m using UTF-8 documents, I’m not sure if that matters though. The bug is where the and tags are placed. I added this code to one document:
    <!–
    Nån som vet någon likadan/bättre tjänst som/än dropbox?Sun, 06 Mar 2011 15:30:54 +0000

    –>

    but the resulting html is:
    <!–

    Nån som vet någon likadan/bättre tjänst som/än dropbox?Sun, 06 Mar 2011 15:30:54 +0000

    –>

    Which isn’t correct html, if it matters I duplicated one of my “news-posts” from my website. EDIT: Last post was borked

  34. Hi, great job, thanks

  35. Bob says:

    This is excellent! Only thing that could make it better is if it didn’t erronously flag newlines as changes. i.e.:

    This is\n
    \n
    my new line

    The script will say that the “is my” was changed if something else is changed (i.e. you add an ! after line).

    Not sure where to begin to look at fixing that.

  36. bob says:

    Actually I figured out the problem is that an example like “end\n\nfirst” is treated as one “word”. I first need to “wrap” the decline characters in spaces.

    So “end \n \n first”. Then after the check just remove the spaces.

  37. Andrea says:

    Man thanks your function is pretty good.

    But why you don’t add those fix ($ret=” and few others) to kill the notices?

    Thanks al ot

  38. Mario says:

    Works great and saved me a lot of time. Thanks!

  39. Pingback: String diff on punctuation: is preg_split the answer? | Coding Answers

  40. Jeff says:

    I’ve been trying to modify this to include differences in punctuation:

    Currently a diff between “I think, therefore” and “I think therefore” will produce:

    I think,think therefore

    I would like it to produce:

    I think, therefore

    Any thoughts on how to achieve this? I messed around with code referenced by Peter West above but couldn’t get this to work.

  41. yarco says:

    I cant use it…dont why, continue saying $maxlen doest define. So i make one by myself,
    here it is:
    http://bbish.net/05toolsproducts/66/php-simple-diff
    I dont even know it is ok or not, but seems good.

  42. yarco says:

    oh, my version is line diff

  43. matt says:

    Hi Paul.

    First, I would like to thank you for publishing this. I have not been able to make this work and your solution is very helpful to me.

    However, I am having performance issues with this on very large strings, and I cannot figure out how to address them. In the code comments, you said that there are ways to address these issues, and I was wondering, would you would mind sharing them?

  44. Christian says:

    Nice code, thanks for this part, especially the htmlDiff is nice!

    @Jeff: You need to split the text using another algorithm than explode(), f.e. preg_split(), but it’s not that simple to implode those words again, because you need the “word-split” character again. Maybe you could use a word iterator, but this would produce a huge amount of additional code.

    Cheers from Germany,
    Chris

  45. Aviram says:

    Great job Paul! you really made my life easier.
    One question, I’ve been trying to change your code in a way that if someone changes the location of some text in the new version (cuts and pasts) it will not count as a change. but I’m having trouble to understand who this can be done using your current implementation of diff.

    could you please help me?

    Thanks!

  46. Mangal Kumar says:

    Great job Paul!

  47. Mike says:

    Great job, Paul!

    I found another version that choked when it got up to around 100 lines, but your algorithm chewed it up and spit it out very nicely!

    Thanks!

  48. Wojciech says:

    Great script, thanks a lot!

  49. JSG says:

    Great work! BEst working of those I could find!!

    One question, maybe someone more experienced can tell me how to do it? I need to exclude html-tags from comparision, so that is everything between , because there are little differences in the ID that should not be padded with as they break the code….

    Tnaks a lot!

  50. Pingback: print_r() function returns result as simple "Array" for diff algorithm

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>