1 Billion Web Pages = 1 Million Dollars? 
Written by Mike James  
Sunday, 16 December 2012  
It seems to be easier that we thought to win "Who Wants to be a Millionaire". All you need is a little data mining and access to the web and answering the questions is almost trivial. Who wants to be a Millionaire is a multiple choice style quiz. A question is asked and a set of possible answers are presented to the contestant. All the contestant has to do is pick the correct answer. There are some interesting strategic complications, such as being able to opt for help from a friend, or reducing the number of alternatives but this is more or less the outline of the game. The rewards are staged and getting a question wrong means that the contestant gets the winnings at the last "milestone"  which could be zero if no milestones have been passed. The first question is can we automate the answering of the multiple choice question using all the huge amounts of data on the web? The answer, according to Shyong (Tony) K. Lam, David M Pennock, Dan Cosley and Steve Lawrence from four different US Universities, is that it turns out to be very easy. It isn't even really a matter of AI  simple data mining will suffice, coupled with a few statistical methods. You can get an improvement with some simple language processing but this doesn't seem to be a big one. The basic method was to take each question and pair it with each possible answer and build a search query. When this query is submitted to Google all you have to do to pick the right answer is to count the number of pages returned. In other words you are using the number of pages returned as a measure of how of the the question occurs along with the answer on the web. The answer with the highest correlation with the question is assumed to be correct. This amazingly simple approach gets over 55% of the questions right. If you add an additional processing step that retrieves the first ten web pages and measures the distance in the text between the question and the answer and use this to weight the results,you can do a little better. But to get a reasonable improvement you have to do some syntactic analysis and work out the proximity of the noun phrases in the question to the possible answers. Using this method, the accuracy goes up to 69%. Using a combination of methods gets you to 70% accuracy, which is impressive given the simple nature of the method. The team also tried out the method with different search engines and you might want to know that the results were:
I think this means that you probably want to take Google with you onto that particular quiz show. The researchers give examples of the types of questions the method tended to get wrong and they are interesting because they give some idea of what it is that limits simple noninterpretive data mining might be subject to:
You might care to ponder how to improve things so at these problems are no longer problems  it isn't easy. The researchers also tackled the problem of working out what the optimal strategy was in terms of risks and rewards as the game progressed. in other words when should you walk away with what you have and give up the quest for the one million dollars. Overall the analysis is interesting and it gives you some idea just how effective simple methods are, if you have enough data. More Information1 Billion Pages = 1 Million Dollars? Mining the Web to Play "Who Wants to be a Millionaire?" Related ArticlesMicrosoft's New Research Center into Social Data How the Music Flows from Place to Place Watson wins Jeopardy!  trick or triumph
To be informed about new articles on I Programmer, install the I Programmer Toolbar, subscribe to the RSS feed, follow us on, Twitter, Facebook, Google+ or Linkedin, or sign up for our weekly newsletter.
Comments
or email your comment to: comments@iprogrammer.info


Last Updated ( Sunday, 16 December 2012 ) 