LHDiff

A Language Independent Technique for Tracking Source Code Lines

Tracking source code lines between two different versions of a file is a fundamental step for solving a number of important problems in software maintenance such as locating bug introducing changes, tracking code fragments or defects across versions, merging of file versions, and software evolution analysis. LHDiff is a language-independent technique for tracking source code lines across versions. It leverages SimHash technique to speed up the line mapping process. Our evaluation of LHDiff with three state-of- the-art techniques using test suites containing different degrees of changes and also with a mutation-based strategy shows a high potential return of our lightweight technique.

Motivating Example

Consider the following two versions of a source file:
code
Popular file differencing program (such as Unix diff utility) cannot track lines that are changed or reordered, instead report that those lines are deleted from the old file and new lines are added in the next version of the file.  For the above example, diff cannot detect that line 4 of the old file is moved to line 8 in the new file. Instead reports line 4 as deleted and line 8 as added. LHDiff can correctly establish mapping between line 4 and 8. While the above example is a toy example, it shows the importance of LHDiff. If the line 4 contains a bug, you can use LHDiff to track the buggy line in the next version to fix it.
You can find more details on our technique in the following link:

Our paper on LHDiff has been accepted as a full technical paper in ICSM 2013, where we not only explain the technique in detail but also compare the technique with other state-of-the-art line tracking techniques using different evaluation methods. The data and code used in our experiment are available to download. In case you want to use those to replicate the study or in a different study please feel free to contcat us.

No More Wait: LHDiff tool is now available to download

A command line version of the tool is now available to download. Please download the jar file and run this from command line using following instruction:
java -jar lhdiff.jar

The tool requires Java runtime environment which can be downloaded from here.

The following command line options are available to work with LHdiff:

Option Description Default Setting
-i Ignore case differences disabled
-k The size of mapping candidate set 15
-p Context weight (0<CXW<1) and the threshold value (0<TH<1) for combine similarity score used in Step 4. Content weight will be automatically set to 1-CXW 0.4 and 0.45
-cnm Line content similarity metric Levenshtein
-cxm Line content similarity metric Cosine
-cxs Context size 4
-ls Detect line split disabled
-ob Display both line number and content display only line number

Usage Example

usage: java -jar lhdiff.jar [-i] [-k candidateSetSize] [-p contextWeight Threshold]
[-cnm contentMetric][-cxm contextMetric] [-cxs contextSize] [-ls lineSplit] 
[-ob outputBoth] oldfile newfile

Screen Shot 2013-06-22 at 1.18.41 AM

You can also type help in the command line to learn details about different options availavle withing LHDiff.

usage: java -jar lhdiff.jar help

2 comments

  1. I have been using LDiff in my research, which is relatively slow in real practice. It seems like that LHDiff could be a great replacement for me.

    The problem is that my code in c and because I have to call LDiff/LHDiff many times (like millions of times), I cannot afford to create a process run LDiff/LHDiff each time. So, I basically re-implemented LDiff in c in my tool. Now I would like to replace LDiff with LHDiff. Do you have any suggestion to reduce my work?

    Thanks!

    • Hi Meng,
      I am happy that you find the tool interesting. I have the LHDiff code implemented in Java. If you want to use LHDiff in your work, I can share that with you. Running the tool from command line is very slow compared to its performance while running the java code directly. You can possibly easily change the code into c.
      But I dont have any implementation in c currently. Thanks.

Leave a comment