Diff |
This Class implements the Difference Algorithm published in "An O(ND) Difference Algorithm and its Variations" by Eugene Myers Algorithmica Vol. 1 No. 2, 1986, p 251. There are many C, Java, Lisp implementations public available but they all seem to come from the same source (diffutils) that is under the (unfree) GNU public License and cannot be reused as a sourcecode for a commercial application. There are very old C implementations that use other (worse) algorithms. Microsoft also published sourcecode of a diff-tool (windiff) that uses some tree data. Also, a direct transfer from a C source to C# is not easy because there is a lot of pointer arithmetic in the typical C solutions and i need a managed solution. These are the reasons why I implemented the original published algorithm from the scratch and make it avaliable without the GNU license limitations. I do not need a high performance diff tool because it is used only sometimes. I will do some performace tweaking when needed. The algorithm itself is comparing 2 arrays of numbers so when comparing 2 text documents each line is converted into a (hash) number. See DiffText(). Some chages to the original algorithm: The original algorithm was described using a recursive approach and comparing zero indexed arrays. Extracting sub-arrays and rejoining them is very performance and memory intensive so the same (readonly) data arrays are passed arround together with their lower and upper bounds. This circumstance makes the LCS and SMS functions more complicate. I added some code to the LCS function to get a fast response on sub-arrays that are identical, completely deleted or inserted. The result from a comparisation is stored in 2 arrays that flag for modified (deleted or inserted) lines in the 2 data arrays. These bits are then analysed to produce a array of Item objects. Further possible optimizations: (first rule: don't do it; second: don't do it yet) The arrays DataA and DataB are passed as parameters, but are never changed after the creation so they can be members of the class to avoid the paramter overhead. In SMS is a lot of boundary arithmetic in the for-D and for-k loops that can be done by increment and decrement of local variables. The DownVector and UpVector arrays are alywas created and destroyed each time the SMS gets called. It is possible to reuse tehm when transfering them to members of the class. See TO-DO: hints. diff.cs: A port of the algorythm to C# Created by Matthias Hertel, see http://www.mathertel.de This work is licensed under a Creative Commons Attribution 2.0 Germany License. see http://creativecommons.org/licenses/by/2.0/de/ Changes: 2002.09.20 There was a "hang" in some situations. Now I undestand a little bit more of the SMS algorithm. There have been overlapping boxes; that where analyzed partial differently. One return-point is enough. A assertion was added in CreateDiffs when in debug-mode, that counts the number of equal (no modified) lines in both arrays. They must be identical. 2003.02.07 Out of bounds error in the Up/Down vector arrays in some situations. The two vetors are now accessed using different offsets that are adjusted using the start k-Line. A test case is added. 2006.03.05 Some documentation and a direct Diff entry point. 2006.03.08 Refactored the API to static methods on the Diff class to make usage simpler. 2006.03.10 using the standard Debug class for self-test now. compile with: csc /target:exe /out:diffTest.exe /d:DEBUG /d:TRACE /d:SELFTEST Diff.cs |