Proposal

Main paper can be found here.

  • an efficient AST diff algorithm
  • not interested in the shorted edit scripts, but the edit script that closely mimics the developer's intent
  • also includes move actions (eg: when doing source code refactor)

Summary

  • All the code is open-sourced here.
  • AST differencing - computing a sequence of edit actions to transform one AST into another (similar to edit distance on strings). This sequence is called edit script
  • following operations are considered in an edit script
    • $$update(t, v)$$ = replace old value of node $$t$$ with value $$v$$
    • $$add(t, t_p, i, l, v)$$ = adds a new node $$t$$ with label $$l$$ and value $$v$$. It will be added as the $$i$$th child of parent node $$t_p$$, if this is not specified, then it will be added as the new root node
    • $$delete(t)$$ = delete the node $$t$$
    • $$move(t, t_p, i)$$ = move node $$t$$ as the $$i$$th child of parent node $$t_p$$. This moves the whole subtree
  • If move operation is included, finding the shortest sequence of edits will be a NP-hard problem
  • typical solutions involve 2 step process
    • finding the mapping between nodes of the 2 ASTs
    • use this mapping to compute the sequence of edits
  • this paper only focuses on the first step. For second step the authors refer us to this work
  • This is further split into 2 sub-steps
    • greedy top-down algo to find isomorphism in the sub trees - anchor mappings
    • bottom-up algo where 2 nodes match if their descendants include a large number of anchor mappings - container mapping. When 2 nodes match, apply an optimal algo to search for additional mappings (recovery mapping) among their descendants
  • height of a node is 1 if it is a leaf node, else maximum height of all its children plus 1.
  • top-down phase
    • uses data structure called height-indexed priority list. It is a sequence of nodes with decreasing height. Following are the ops supported
      • push - pushes a node onto the list
      • peekMax - returns the greatest height of the list
      • pop - returns and removes all nodes with height peekMax
      • open - pushes all children of a given node into the list
    • dice function - ratio of common descendants between 2 nodes
      • $$dice(t_1, t_2, M) = \frac{2 |t_1 \epsilon s(t_1)|}{|s(t_1)| + |s(t_2)|}$$
      • $$M$$ = mapping
      • $$s(t_)$$ = set of descendants of $$t_$$
  • bottom-up phase
    • takes the output of the top-down phase as input
    • maps the remaining unmapped nodes from top-down phase
  • the perf bottleneck of this approach is the parsing time to generate the ASTs
    • thus a "simple" difference tool like diff can be a lot faster than this!
    • however, such tools are limited to only line-level granularity