Fine-grained and Accurate Source Code Differencing
Proposal
Main paper can be found here.
- an efficient AST diff algorithm
- not interested in the shorted edit scripts, but the edit script that closely mimics the developer's intent
- also includes move actions (eg: when doing source code refactor)
Summary
- All the code is open-sourced here.
- AST differencing - computing a sequence of edit actions to transform one AST into another (similar to edit distance on strings). This sequence is called edit script
- following operations are considered in an edit script
- $$update(t, v)$$ = replace old value of node $$t$$ with value $$v$$
- $$add(t, t_p, i, l, v)$$ = adds a new node $$t$$ with label $$l$$ and value $$v$$. It will be added as the $$i$$th child of parent node $$t_p$$, if this is not specified, then it will be added as the new root node
- $$delete(t)$$ = delete the node $$t$$
- $$move(t, t_p, i)$$ = move node $$t$$ as the $$i$$th child of parent node $$t_p$$. This moves the whole subtree
- If move operation is included, finding the shortest sequence of edits will be a NP-hard problem
- typical solutions involve 2 step process
- finding the mapping between nodes of the 2 ASTs
- use this mapping to compute the sequence of edits
- this paper only focuses on the first step. For second step the authors refer us to this work
- This is further split into 2 sub-steps
- greedy top-down algo to find isomorphism in the sub trees - anchor mappings
- bottom-up algo where 2 nodes match if their descendants include a large number of anchor mappings - container mapping. When 2 nodes match, apply an optimal algo to search for additional mappings (recovery mapping) among their descendants
- height of a node is 1 if it is a leaf node, else maximum height of all its children plus 1.
- top-down phase
- uses data structure called height-indexed priority list. It is a sequence of
nodes with decreasing height. Following are the ops supported
- push - pushes a node onto the list
- peekMax - returns the greatest height of the list
- pop - returns and removes all nodes with height peekMax
- open - pushes all children of a given node into the list
- dice function - ratio of common descendants between 2 nodes
- $$dice(t_1, t_2, M) = \frac{2 |t_1 \epsilon s(t_1)|}{|s(t_1)| + |s(t_2)|}$$
- $$M$$ = mapping
- $$s(t_)$$ = set of descendants of $$t_$$
- uses data structure called height-indexed priority list. It is a sequence of
nodes with decreasing height. Following are the ops supported
- bottom-up phase
- takes the output of the top-down phase as input
- maps the remaining unmapped nodes from top-down phase
- the perf bottleneck of this approach is the parsing time to generate the ASTs
- thus a "simple" difference tool like
diff
can be a lot faster than this! - however, such tools are limited to only line-level granularity
- thus a "simple" difference tool like