How do programs change over time?

Jason Dagit and I recently wrote a paper that got accepted at the “DChanges 2013” workshop in Florence, Italy this September.  Alas, I didn’t get to attend since I had to be at another conference at the same time.  Jason had to endure a trip to Italy solo – which, I’m sure was a huge burden.  🙂  That said, this is a bit of work that I am quite excited to share with the world.

The paper is available on the ArXiV entitled “Identifying Change Patterns in Software History” – the final version isn’t much different (the reviewers liked it and didn’t request many changes), and fortunately the workshop has an open access policy on the publications.  So, the final version will be available in the wild in the not-too-distant future and won’t likely be too hard to turn up.

The problem we looked at is simply stated: what does the change history of a piece of software as represented in a source control system tell us about what people did to it over time?  Anyone who has worked on a project for any substantial amount of time knows that working on code isn’t dominated by adding new features – it is mostly an exercise of cleaning up and repairing flaws, reorganizing code to make it easier to add to in the future, and adding things that make the code more robust.  During the process of making these changes, I have often found that it feels like I do similar things over and over – add a null pointer check here, rearrange loop counters there, add parameters to functions elsewhere.  Odds are, if you asked a programmer “what did you have to do to address issue X in your code”, they would describe a pattern instead of an explicit set of specific changes, such as “we had to add a status parameter and tweak loop termination tests”.

This led us to ask : can we identify the patterns that the programmers had in mind that they would tell you in response to our question of “what did you do”, simply by looking at the evolutions stored in a version control system?  While easily stated, the question we looked at in our paper was how one actually goes about doing this.

We started with some work I had developed as part of a Dept. of Energy project called “COMPOSE-HPC” where we built infrastructure to manipulate programs in their abstract syntax form via a generic text representation of their syntax trees.  The representation we chose was the Annotated Term form used by the Stratego/XT and Spoofax Workbench projects.  A benefit of the ATerm form for programs is that it allows us to separate the language parser from the analyzer – parsing takes place in whatever compiler front end is available, and all we require is a traversal of the resulting parse tree or resulting abstract syntax tree that can emit terms that conform to the ATerm format.

Instead of reproducing the paper here, I recommend that interested people download it and read it over.  I believe the slides will be posted online at some point so you can see the presentation that Jason gave.  The short story is that given the generic ATerm representation for the program, we combined a structural differencing algorithm (specifically, the one discussed in this paper) with the well known anti-unification algorithm to identify change points and distill the abstract pattern represented by the change.  The details in the paper that aren’t reproduced here are how we defined a metric of similarity for clustering changes such that anti-unification could be applied over groups of similar changes to yield meaningful patterns.  To show the idea at work, we used the existing Haskell language-java parser to parse code and wrote a small amount of code to emit an aterm representation that could be analyzed.  We applied it to two real open source repositories – the one for the ANTLR parser project and the Clojure compiler.  It was satisfying to apply it to real repositories instead of contrived toy repositories – we felt that the fact the idea didn’t fall over when faced with the complexity and size of real projects indicated that we had something of real interest to share with the world here.

The code for this project will be up on github in the not too distant future – I will update this post when it becomes available.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s