|MINE - Finding Patterns in Big Data|
|Written by Kay Ewbank|
|Tuesday, 20 December 2011|
A new suite of statistical tools called MINE has been made available to help researchers find hidden patterns in vast data sets.
A new suite of statistical tools called MINE has been developed by researchers from the Broad Institute and Harvard University to work on large data sets more effectively. In an article in this week's Science journal (which is behind a paywall), the researchers say the tool analyses data in a way that no other software program can, which is a pretty bold claim, but as they've made the tool available for you to try as a download, they must be confident.
The analysis can identify multiple patterns hidden in information. Sample sets of data include health data from around the world, statistics amassed from a season of major league baseball, and data on the changing bacterial landscape of the gut.
The problem the researchers are attempting to solve is the fact that really large data sets are very difficult to analyze. Software exists to search the data sets rapidly so long as you know what you’re looking for, but if a researcher wants to identify what hidden patterns are there, the existing software isn't ideal.
"There are massive data sets that we want to explore, and within them, there may be many relationships that we want to understand,"
said Broad Institute associate member Pardis Sabeti, senior author of the paper and an assistant professor at the Center for Systems Biology at Harvard University.
"The human eye is the best way to find these relationships, but these data sets are so vast that we can't do that. This toolkit gives us a way of mining the data to look for relationships."
MINE’s (Maximal Information-based Nonparametric Exploration ) advantage is that it can detect a wide range of patterns and characterize them according to a number of different parameters, providing scores and comparisons for the different kinds of possible relationships.
According to David Reshef, one of the lead authors of the paper, Detecting Novel Associations in Large Data Sets:
"Standard methods will see one pattern as signal and others as noise."
He went on to explain that there can potentially be a variety of different types of relationships in a given data set, and that MINE looks for any type of clear structure within the data, attempting to find all of them and to treat all the potential data patterns equally, concluding:
"This ability to search for patterns in an equitable way offers tremendous exploratory potential in terms of searching for patterns without having to know ahead of time what to search for".
The researchers tested their analytical toolkit on several large data sets, including one consisting of data on the trillions of microorganisms that live in the gut. The research team used MINE to make more than 22 million comparisons and narrowed in on a few hundred patterns of interest that had not been observed before.
The way the tool works is to generate hypotheses for the researchers to examine. If you have a dataset with multiple dimensions to explore, one technique is to calculate some measure of dependence for each pair of variables, rank the pairs by their scores, and examine the top-scoring pairs.
The statistic used to measure dependence should have generality and equitability. Generality means that the statistic should capture a wide range of interesting associations, not limited to specific function types (such as linear, exponential, or periodic), or even to all functional relationships. Equitability means that the statistic should give similar scores to equally noisy relationships of different types.
For instance, if there’s a linear relationship and a sinusoidal relationship giving similar values to each other, they should get the same score.
The researchers measured the maximal information coefficient (MIC), a measure of two-variable dependence developed with the guidelines of generality and equitability in mind. They say MIC comes very close to achieving both goals simultaneously, and that it significantly outperforms competing methods in this regard.
You can try the software for yourself by downloading it (and sample data sets) here: http://www.exploredata.net/Downloads.
|Last Updated ( Tuesday, 20 December 2011 )|