A free Excel plug-in that takes a new approach to automatically finding data errors in spreadsheets has been released. Rather than looking for errors in programs and formulas, CheckCell sets out to debug the underlying data.
The tool, called CheckCell comes from Daniel Barowy and Dimitar Gochev, computer science doctoral students at the University of Massachusetts Amherst. It was launched at this year's international computer programming language design conference, OOPSLA.
In his conference session Daniel Barowy told delegates:
"Consider the case of a paper written by Harvard economists Carmen Reinhart and Kenneth Rogoff a couple of years ago. The paper was influential, lending credibility to government austerity measures in Europe and the United States. But in 2013, UMass Amherst economist Thomas Herndon and colleagues found, in combing through the data by hand, that methodological errors undermined Reinhart and Rogoff's argument. In particular, Reinhart and Rogoff exaggerated the impact of key data values in a spreadsheet."
The CheckCell group, supervised by Prof Emery Berger. wondered whether software might be developed to find these kinds of errors automatically and successfully devised an approach that found a number of the same errors uncovered by Herndon.
The paper presented at OOPSLA refers to the approach as data debugging, which combines program analysis and statistical analysis to automatically find potential data errors by locating data that has a disproportionate impact on the computation.
Putting it more informally Barowy explains:
"Basically, CheckCell identifies data points that have a big impact on the final result, even if the impact is super subtle and difficult to detect. CheckCell immediately flags data points that are very suspicious, the ones that deserve a second look. It's like having a helper who says, 'pay attention to these cells, they really matter.'"
The technique uses what Berger calls "a threshold of unusualness." CheckCell marks high-impact data points in red and asks the spreadsheet designer to check them. If they are indeed correct, they turn green and will not be flagged in subsequent analyses.
The example cited involves getting the digits in a data point transposed, so when a teacher has an “A” student who would be expected to get a 94 on a test and the spreadsheet says that student got a 49, CheckCell would flag it. A similar example is used in the paper:
Isn't this just range checking of the sort we have always told users to build into spreadsheets?
I think it is fair to say it is a valuable extension of the idea that isn't as coarse grained as manual checks would be and by working from the unexpected outcome to the data brings a welcome degree of automation to the problem.
Berger says that in the future the team plan to extend CheckCell's use to large-scale data sets where errors that have a large impact are difficult to find. Having already shown that CheckCell would have uncovered some of the flaws in the Reinhart/Rogoff data, this seems a worthwhile endeavor.
You can install CheckCell 1.0.2 for Excel 10 and later via GitHub and its source code is also available. gCheckCell, a version for Google Spreadsheet is also on GitHub.