C# Class my.utils.Diff

This Class implements the Difference Algorithm published in "An O(ND) Difference Algorithm and its Variations" by Eugene Myers Algorithmica Vol. 1 No. 2, 1986, p 251. There are many C, Java, Lisp implementations public available but they all seem to come from the same source (diffutils) that is under the (unfree) GNU public License and cannot be reused as a sourcecode for a commercial application. There are very old C implementations that use other (worse) algorithms. Microsoft also published sourcecode of a diff-tool (windiff) that uses some tree data. Also, a direct transfer from a C source to C# is not easy because there is a lot of pointer arithmetic in the typical C solutions and i need a managed solution. These are the reasons why I implemented the original published algorithm from the scratch and make it avaliable without the GNU license limitations. I do not need a high performance diff tool because it is used only sometimes. I will do some performace tweaking when needed. The algorithm itself is comparing 2 arrays of numbers so when comparing 2 text documents each line is converted into a (hash) number. See DiffText(). Some chages to the original algorithm: The original algorithm was described using a recursive approach and comparing zero indexed arrays. Extracting sub-arrays and rejoining them is very performance and memory intensive so the same (readonly) data arrays are passed arround together with their lower and upper bounds. This circumstance makes the LCS and SMS functions more complicate. I added some code to the LCS function to get a fast response on sub-arrays that are identical, completely deleted or inserted. The result from a comparisation is stored in 2 arrays that flag for modified (deleted or inserted) lines in the 2 data arrays. These bits are then analysed to produce a array of Item objects. Further possible optimizations: (first rule: don't do it; second: don't do it yet) The arrays DataA and DataB are passed as parameters, but are never changed after the creation so they can be members of the class to avoid the paramter overhead. In SMS is a lot of boundary arithmetic in the for-D and for-k loops that can be done by increment and decrement of local variables. The DownVector and UpVector arrays are alywas created and destroyed each time the SMS gets called. It is possible to reuse tehm when transfering them to members of the class. See TODO: hints. diff.cs: A port of the algorythm to C# Copyright (c) by Matthias Hertel, http://www.mathertel.de This work is licensed under a BSD style license. See http://www.mathertel.de/License.aspx Changes: 2002.09.20 There was a "hang" in some situations. Now I undestand a little bit more of the SMS algorithm. There have been overlapping boxes; that where analyzed partial differently. One return-point is enough. A assertion was added in CreateDiffs when in debug-mode, that counts the number of equal (no modified) lines in both arrays. They must be identical. 2003.02.07 Out of bounds error in the Up/Down vector arrays in some situations. The two vetors are now accessed using different offsets that are adjusted using the start k-Line. A test case is added. 2006.03.05 Some documentation and a direct Diff entry point. 2006.03.08 Refactored the API to static methods on the Diff class to make usage simpler. 2006.03.10 using the standard Debug class for self-test now. compile with: csc /target:exe /out:diffTest.exe /d:DEBUG /d:TRACE /d:SELFTEST Diff.cs 2007.01.06 license agreement changed to a BSD style license. 2007.06.03 added the Optimize method. 2007.09.23 UpVector and DownVector optimization by Jan Stoklasa (). 2008.05.31 Adjusted the testing code that failed because of the Optimize method (not a bug in the diff algorithm). 2008.10.08 Fixing a test case and adding a new test case.
Mostra file Open project: FloodProject/flood Class Usage Examples

Public Methods

Method Description
DiffInt ( int ArrayA, int ArrayB ) : System.Item[]

Find the difference in 2 arrays of integers.

DiffText ( string TextA, string TextB ) : System.Item[]

Find the difference in 2 texts, comparing by textlines.

DiffText ( string TextA, string TextB, bool trimSpace, bool ignoreSpace, bool ignoreCase ) : System.Item[]

Find the difference in 2 text documents, comparing by textlines. The algorithm itself is comparing 2 arrays of numbers so when comparing 2 text documents each line is converted into a (hash) number. This hash-value is computed by storing all textlines into a common hashtable so i can find dublicates in there, and generating a new number each time a new textline is inserted.

Main ( string args ) : int

start a self- / box-test for some diff cases and report to the debug output.

TestHelper ( System.Item f ) : string

Private Methods

Method Description
CreateDiffs ( DiffData DataA, DiffData DataB ) : System.Item[]

Scan the tables of which lines are inserted and deleted, producing an edit script in forward order.

DiffCodes ( string aText, Hashtable h, bool trimSpace, bool ignoreSpace, bool ignoreCase ) : int[]

This function converts all textlines of the text into unique numbers for every unique textline so further work can work only with simple numbers.

LCS ( DiffData DataA, int LowerA, int UpperA, DiffData DataB, int LowerB, int UpperB, int DownVector, int UpVector ) : void

This is the divide-and-conquer implementation of the longes common-subsequence (LCS) algorithm. The published algorithm passes recursively parts of the A and B sequences. To avoid copying these arrays the lower and upper bounds are passed while the sequences stay constant.

Optimize ( DiffData Data ) : void

If a sequence of modified lines starts with a line that contains the same content as the line that appends the changes, the difference sequence is modified so that the appended line and not the starting line is marked as modified. This leads to more readable diff sequences when comparing text files.

SMS ( DiffData DataA, int LowerA, int UpperA, DiffData DataB, int LowerB, int UpperB, int DownVector, int UpVector ) : SMSRD

This is the algorithm to find the Shortest Middle Snake (SMS).

Method Details

DiffInt() public static method

Find the difference in 2 arrays of integers.
public static DiffInt ( int ArrayA, int ArrayB ) : System.Item[]
ArrayA int A-version of the numbers (usualy the old one)
ArrayB int B-version of the numbers (usualy the new one)
return System.Item[]

DiffText() public method

Find the difference in 2 texts, comparing by textlines.
public DiffText ( string TextA, string TextB ) : System.Item[]
TextA string A-version of the text (usualy the old one)
TextB string B-version of the text (usualy the new one)
return System.Item[]

DiffText() public static method

Find the difference in 2 text documents, comparing by textlines. The algorithm itself is comparing 2 arrays of numbers so when comparing 2 text documents each line is converted into a (hash) number. This hash-value is computed by storing all textlines into a common hashtable so i can find dublicates in there, and generating a new number each time a new textline is inserted.
public static DiffText ( string TextA, string TextB, bool trimSpace, bool ignoreSpace, bool ignoreCase ) : System.Item[]
TextA string A-version of the text (usualy the old one)
TextB string B-version of the text (usualy the new one)
trimSpace bool When set to true, all leading and trailing whitespace characters are stripped out before the comparation is done.
ignoreSpace bool When set to true, all whitespace characters are converted to a single space character before the comparation is done.
ignoreCase bool When set to true, all characters are converted to their lowercase equivivalence before the comparation is done.
return System.Item[]

Main() public static method

start a self- / box-test for some diff cases and report to the debug output.
public static Main ( string args ) : int
args string not used
return int

TestHelper() public static method

public static TestHelper ( System.Item f ) : string
f System.Item
return string