A diff algorithm in its most basic form takes two strings, and returns the changes needed to make the old string into the new one. They are useful in comparing different versions of a document or file, to see at a glance what the differences are between the two. Wikipedia, for example, uses diffs to compare the changes between two revisions of the same article.
Solving the problem is not as simple as it seems, and the problem bothered me for about a year before I figured it out. I managed to write my algorithm in PHP, in 18 lines of code. It is not the most efficient way to do a diff, but it is probably the easiest to understand.
It works by finding the longest sequence of words common to both strings, and recursively finding the longest sequences of the remainders of the string until the substrings have no words in common. At this point it adds the remaining new words as an insertion and the remaining old words as a deletion.
You can download the source here: PHP SimpleDiff


WOW very impressive!
This is excellent work, Paul. Thanks for offering it. We’ve been looking for a nice, clean implementation of diff in PHP that is unconstrained by line breaks (we don’t care about where the line breaks are in HTML). We operate a non-profit project that allows volunteers to update our web site, and I need to see where the changes are so I can approve and activate them. I’ve chopped my users’ website updates on whitespace and ‘>’ tag ends, and used your code to present a very clear diff of their changes.
Nice work Paul, this could come in handy sometime :)
This is a great algorithm Paul, but I am trying to figure out how this is different from using PHP’s built-in array_diff() and array_intersect() functions to achieve this.
Your post reminds me of the Edit Distance dynamic problem I first came across in this video from MIT:
http://people.csail.mit.edu/bdean/6.046/dp/dp_8.swf
Rajesh, interesting video. I’ve wondered how to best solve that problem before. I have a hunch that diffing two strings and finding the edit distance reduces to the same problem. However, I used a different algorithm than they used in the video. Their algorithm guarantees an optimal solution, so it might be fun to implement as well.
array_diff() and array_intersect() give the intersection and difference of two sets. When you are comparing two pieces of text, for example, it doesn’t do you much good to know what words have been inserted or removed without knowing where they were inserted and removed from.
Ahh! I get it now. Thanks. I’m going to go away and try to implement a recursive algorithm that does the same thing but uses str_word_count($str, 2) which gives me the “where” bit.
Rajesh, neat, I’ve never seen str_word_count before, but it should do the trick nicely. I used array_keys() to do the same thing since I was working with an array, but if you just want to diff strings it makes more sense to keep them in string format. Let me know how the implementation goes.
I’ve been studying your code and it’s very clever but unfortunately I am not as clever apparently! I am trying to figure out the correct (easy) method to use the diff array it generates to “roll back” the newer text to the old text so just the difference can to be stored.
I think I am getting caught up on how the array key is the word position of the OLD text and not the NEW text? Is that correct? Is there a better way to handle this problem?
Ck, the code is admittedly a bit hard to read. I found I had trouble understanding it myself when I recently read over it to make the Python version (which you should study instead if you know Python, since it is the better of the two). By easiest to understand, I meant the algorithm itself, not my cryptic implementation of it.
The big for loop is to find the largest substring between new and old. The string variables could be swapped for one another in this loop and the output should still be the same.
Well essentially what I’m doing is taking the resulting array your subroutine generates. Then I loop through it an unset anything that’s not an array, so that leaves just the array of “diffs” which are indexed by the word position. That’s what I want to store so if there’s a 1000 word document, and one word inserted and one word deleted, only two small arrays are stored.
The problem is, when I have only the new document to work backwards from, the index positions in your array are based on the old document. It’s occurred to me I could recurse and recount the words every time I effect a change from the array before continuing to the next set and that might “line-up” the word positions but it doesn’t seem to be working and the complexity of the resulting code is giving me a migrane…
So I guess what I am really asking, is there a way to modify your routine so that it stores the index positions based on the new document which would make the reversal (undo) much easier for me? Perhaps after the forward-looking array is generated it could be converted into a reverse-looking one somehow?
(sorry I don’t know the correct terminology, completely self-taught and an amateur at best!)
I think calling diff with the arguments reversed (diff($new_string, $old_string)) might do what you want.
You might also want to look into delta encoding: (http://en.wikipedia.org/wiki/Delta_encoding).
With fresh eyes (and migraine finally gone) I figured it out by studying some samples step-by-step. The resulting code to do rollbacks is so simple it’s embarrassing:
($diffdiff is the $diff with all non-array and blank elements removed but all other keys kept intact. $new is the current text, where we want rollback to $old which is our previous edit.)
$old=explode(” “,$new);
$offset=-1;
foreach ($diffdiff as $key=>$value) {
array_splice ( $old , $key+$offset , count($diffdiff[$key]['i']) , $diffdiff[$key]['d'] );
$offset+=count($diffdiff[$key]['d']) -1;
}
$old=implode(” “,$old);
tada!
(I hope the blog software doesn’t eat the code)
Mate, If your code actually does the job then I owe you big time! I’ve been doing the same work for my academic project for a couple of days lately: getting diff array, stripping it of non-changes and storing in a db for a small version control system I’m building for my app. The only difference is, I’m keeping the original ‘old’ file so what i want to do is to generate the new one basing on the original file and the array of changes (or arrays – for further revisions). Unfortunately, the indexes in the array get all messed up and don’t correspond neither to the old or new file if there is multiple line deletion or insertion in succeeding lines (to be precise – if there number of lines inserted and deleted in one step does not match) and I took me ages to deal with it. Still, I’m only getting the first revision right (the one based on the original file). If I’m trying to get further revisions, meaning I’m not working on the original file anymore but on the one generated from it, It all goes messy again :( I’ll check if your code works for my project as soon as I get back home. I hope I’m free to use it for non-commercial use, since it’s posted on the forum?
Wow, that is simple.. Thanks for posting it.
I was currently looking for something just like that – simple, independent on other libraries, not hundreds of lines of code.
Nice work and (as said above) thanks for releasing it.
Great function Paul! Thought this might help someone out…here is the “diff” function and the presentation code to display the lines that are different, as well as the actual differences in each line highlighted:
(hoping that my code will remain intact here)
$oldValue){
$newKeys = array_keys($new, $oldValue);
foreach($newKeys as $newIndex){
$matrix[$oldIndex][$newIndex] = isset($matrix[$oldIndex - 1][$newIndex - 1]) ?
$matrix[$oldIndex - 1][$newIndex - 1] + 1 : 1;
if($matrix[$oldIndex][$newIndex] > $maxlen){
$maxlen = $matrix[$oldIndex][$newIndex];
$oldMax = $oldIndex + 1 – $maxlen;
$newMax = $newIndex + 1 – $maxlen;
}
}
}
if($maxlen == 0) {
$diffArray = array(array(‘deleted’=>$old, ‘inserted’=>$new));
} else {
$diffArray = array_merge(
diff(array_slice($old, 0, $oldMax), array_slice($new, 0, $newMax)),
array_slice($new, $newMax, $maxlen),
diff(array_slice($old, $oldMax + $maxlen), array_slice($new, $newMax + $maxlen))
);
}
return $diffArray;
}
$diffArray = diff($f2, $f1);
?>
#diffTable {
border-collapse: collapse;
}
#diffTable td {
padding: 1px 5px 1px 10px;
vertical-align: top;
}
td.lineNumber {
border-right: 1px solid #ccc;
}
span.strDiff {
background-color: #B6FF9F;
padding: 0;
margin: 0;
}
$diff) {
if(!is_array($diff)) {
?>
count($diff['inserted']) ? $diff['deleted'] : $diff['inserted'];
foreach($biggest as $diffKey => $diffVal) {
if(isset($diff['deleted'][$diffKey]) &&
isset($diff['inserted'][$diffKey]) &&
trim($diff['deleted'][$diffKey]) == trim($diff['inserted'][$diffKey])) {
$newStr1 = $diff['deleted'][$diffKey];
$newStr2 = $diff['inserted'][$diffKey];
$backgroundStyle = “”;
} else {
//explode the strings so we can highlight the differences between strings
$str1 = explode(” “, $diff['deleted'][$diffKey]);
$str2 = explode(” “, $diff['inserted'][$diffKey]);
$newStr1 = “”;
$newStr2 = “”;
if(count($str1) >= count($str2)) {
foreach($str1 as $strKey => $strVal) {
if(isset($str1[$strKey]) && isset($str2[$strKey])) {
if($str1[$strKey] != $str2[$strKey]) {
$newStr1 .= ‘ ‘.$str1[$strKey].”;
$newStr2 .= ‘ ‘.$str2[$strKey].”;
} else {
$newStr1 .= ‘ ‘.$str1[$strKey];
$newStr2 .= ‘ ‘.$str2[$strKey];
}
} elseif(isset($str1[$strKey]) && !isset($str2[$strKey])) {
$newStr1 .= ‘ ‘.$str1[$strKey];
$newStr2 .= ”;
} else {
$newStr1 .= ”;
$newStr2 .= ‘ ‘.$str2[$strKey];
}
}
} else {
foreach($str2 as $strKey => $strVal) {
if(isset($str2[$strKey]) && isset($str1[$strKey])) {
if($str2[$strKey] != $str1[$strKey]) {
$newStr1 .= ‘ ‘.$str1[$strKey].”;
$newStr2 .= ‘ ‘.$str2[$strKey].”;
} elseif(isset($str2[$strKey]) && !isset($str1[$strKey])) {
$newStr2 .= ‘ ‘.$str2[$strKey];
$newStr1 .= ”;
} else {
$newStr2 .= ”;
$newStr1 .= ‘ ‘.$str1[$strKey];
}
}
}
}
$backgroundStyle = “background-color: #FFCFCF;”;
}
?>
“>
“>
“>
“>
oops that didn’t work out so well. I guess Ill try emailing the code to Paul so he can post it if he wants to. Sorry!
Yeah, unfortunately WordPress doesn’t work well with code in comments. Thanks for e-mailing the code, I’ll have a look.
How hard would it be to adapt this to display the strings side by side with differences highlighted, similar to the wikimedia difference engine?
Great script, thanks!
Hi Paul,
I was told to manage a website that actually already uses your algorithm.
After redesigning the program logic and database to handle UTF8-data instead of iso latin1, I discover line breaks being displayed in a wrong way.
Is your algorithm UTF8-safe?
regards, Chris
A beautiful use of recursion to solve the problem of computing diff!!
Great code, really useful. I was searching for a text diff without having to install PEAR stuff, that’s it. Thanks a LOT ! I’m very grateful.
Paul, this is really a good portion of code in a nice way – thanks a lot. Could you please give us a link to the snipped gbruins mailed you? I am very interested in it. You could send it via mail too.
Wish you a happy new year!
I wrote a diff algorithm a few days ago, but because I’m a perfectionist it had exponentially bad performance (it was essentially a brute force algorithm). So I scrapped mine and spent a few hours working out yours. I never tire of seeing a really good recursive function! After some testing I’m satisfied the performance is really excellent.
Anyway your diff algorithm was not quite suitable for my purposes so I’ve made some changes and additions here:
http://piratepad.net/ep/pad/view/IY48xXzsys/syZUX3SzwI
Notably I’m using preg_split() with a zero length lookahead match, allowing the string to be split by any combination of characters (I’ve found that ” “, “\n” and “.” are a good combination) or by each character using “”.
This also helped with the other feature I’ve added: sequentialChanges() which converts a diff into a list of sequential changes and renderChanges() which applies those changes (not having missing spaces everywhere made this a lot easier).
Here’s an example:
$diff = new Diff
$old = “foo bar zim”
$new = “zim bar gir”
$result = $diff->stringDiff($old, $new, ” “)
$changes = $diff->sequentialChanges($result)
$diff->renderChanges($changes, $old)
The final line takes $old and uses the list of changes to recreate $new. I’m planning to use this as version control for my blog. I’ll also put this on github when I have time.
Whaaaa, great ! Thanks for sharing !