|Identifying Programmers From Executable Binaries|
|Written by Mike James|
|Wednesday, 06 January 2016|
It's no surprise that programmers have different styles. What comes as a shock is that these are still evident when the code is compiled and you produce an executable binary.
When you look at someone's code you can often see that they have a particular style. How they name variables, the comments they use, the indenting schemes and other important details.A lot of this personal preference is removed when the code is compiled but there still seems to be enough left to identify the programmer. This has all sorts of implications.
The study took the code of 600 programmers from the annual programming competition, Google Code Jam. The skill of the programmer was measured by how far they progressed in the competition. The code samples, all written in C++, were trying to solve the same programming task and hence the main differences between the code could possibly be attributed to coding style among other things.
Given the binary code the problem of identifying the programmer was treated as a machine learning problem - which of course it is. The first task was to extract features and this was done by disassembling the code and then decompiling it back to the C++. The exact details of the reverse engineering involved is interesting and given in the paper. As well as the assembly and the reconstructed C++ code an abstract syntax tree and a control flow graph were used to provide features. Rather than a neural network, a random forest classifier was used to learn each programmer's characteristics from the hand-constructed features.
The results are impressive. Classification of 20 programmers was possible with a 96% correct classification. The classifier was trained on 8 executables for each programmer, which represents a lot of examples for this sort of study.
When the approach was tried on a larger data set, 600 programmers, the accuracy fell to 52%. It was also demonstrated that for unoptimized compilation of 100 programmers the accuracy was 78%, but when an optimizing compiler was used the accuracy fell to 64%. You might expect an optimizing compiler to remove even more of the personal traits of the programmer from the binary.
Some other interesting results are quoted in the paper. The most generally interesting is that more advanced programmers are easier to recognize compared to beginners. This suggests that beginners tend to code in the same way while more expert programmers are more individual and have distinct coding styles.
If you want more details without reading the paper, this video should help:
So does all this matter?
Apart from the interesting results that individual style develops with experience and survives many transformations to machine code there is also the question of forensics. If you plan writing any malware then make sure that you don't leave any compiled code around where it could be used to identify you. Equally, identification of the programmer might be helpful in disputes about who did what in a successful company. Could code style used to address questions like "are there any blocks of code left in Facebook that Mark Zuckerberg wrote?" and to prove who Satoshi Nakamoto, the anonymous inventor of Bitcoin was? Probably not in practice.
Then there is the question of building tools that take raw code and scrambles it in such away that it can't be identified. You could even try and build a tool that made code written by A look like code in the style of B.
There are also some interesting questions about the methodology used. For example, what would the classification error be if the raw machine code was fed to a neural network. After all, it should be capable of noticing the regularities that the reverse engineering used to create features. There is also the question of how well this generalizes to other languages. C++ is well known, and indeed widely criticised, for being so flexible that you can code in almost any style from low level C to sophisticated object-oriented. Perhaps this is as much about C++ as the programmers.
Clearly more work would be interesting.
When Coding Style Survives Compilation: De-anonymizing Programmers from Executable Binaries (pdf) Aylin Caliskan-Islam, Fabian Yamaguchi, Edwin Dauber, Richard Harang, Konrad Rieck, Rachel Greenstadt, and Arvind Narayanan
or email your comment to: firstname.lastname@example.org
|Last Updated ( Wednesday, 06 January 2016 )|