1 Billion Web Pages = 1 Million Dollars?
Written by Mike James   
Sunday, 16 December 2012

It seems to be easier that we thought to win "Who Wants to be a Millionaire". All you need is a little data mining and access to the web and answering the questions is almost trivial.

Who wants to be a Millionaire is a multiple choice style quiz. A question is asked and a set of possible answers are presented to the contestant. All the contestant has to do is pick the correct answer. There are some interesting strategic complications, such as being able to opt for help from a friend, or reducing the number of alternatives but this is more or less the outline of the game. The rewards are staged and getting a question wrong means that the contestant gets the winnings at the last "milestone" - which could be zero if no milestones have been passed.


The first question is can we automate the answering of the multiple choice question using all the huge amounts of data on the web?

The answer, according to Shyong (Tony) K. Lam, David M Pennock, Dan Cosley and Steve Lawrence from four different US Universities, is that it turns out to be very easy. It isn't even really a matter of AI - simple data mining will suffice, coupled with a few statistical methods. You can get an improvement with some simple language processing but this doesn't seem to be a big one.

The basic method was to take each question and pair it with each possible answer and build a search query. When this query is submitted to Google all you have to do to pick the right answer is to count the number of pages returned. In other words you are using the number of pages returned as a measure of how of the the question occurs along with the answer on the web. The answer with the highest correlation with the question is assumed to be correct.

This amazingly simple approach gets over 55% of the questions right. If you add an additional processing step that retrieves the first ten web pages and measures the distance in the text between the question and the answer and use this to weight the results,you can do a little better. But to get a reasonable improvement you have to do some syntactic analysis and work out the proximity of the noun phrases in the question to the possible answers. Using this method, the accuracy goes up to 69%. Using a combination of methods gets you to 70% accuracy, which is impressive given the simple nature of the method.

The team also tried out the method with different search engines and you might want to know that the results were:

Engine Combined
Google 70%
AltaVista 68%
AllTheWeb 66%
MSN(Bing) 58%


I think this means that you probably want to take Google with you onto that particular quiz show.

The researchers give examples of the types of questions the method tended to get wrong and they are interesting because they give some idea of what it is that limits simple non-interpretive data mining might be subject to:

  1. Common  Sense.  How many legs does a fish have?  0, 2,  or 4?  This  information  may  exist  on  the  web,  but  is probably not spelled out.
  2. Multiple  Plausible  Answers.  What does the letter  "]" stand for in the computer company name "IBM"? Information,  International,  Industrial,  or Infrastructure?  "In­formation" probably appears just as often as "international" in the context of IBM.
  3. Polysemy.  Which  of these  parts  of a  house shares  its name with a viewing area on a computer screen?  Wall,Root, Window, or Basement? The words "root" and  "computer" often co-occur (e.g., the Unix superuser). This ques­tion  also  suggests  that  biases  in  the  content  of  the  web­ originally  by  and  for  technical, computer-literate users­may hamper using the web as  a general knowledge base in some instances.
  4. Non-Textual Knowledge.  Which of these cities is located in Russia?  Kiev, Minsk,  Odessa,  or or Omsk?  The program doesn't know how to read maps.
  5. Alternative  Representations.  Who  is  Flash  Gordon's
    archenemy?  Doctor  Octopus,  Sinestro, Ming the Merciless,  or Lex Luthor?  The word  "archenemy" usually ap­pears as two words ("arch enemy") on Flash Gordon (and other) pages.

You might care to ponder how to improve things so at these problems are no longer problems - it isn't easy.

The researchers also tackled the problem of working out what the optimal strategy was in terms of risks and rewards as the game progressed. in other words when should you walk away with what you have and give up the quest for the one million dollars.

Overall the analysis is interesting and it gives you some idea just how effective simple methods are, if you have enough data.


The Significance Of Big Data

Microsoft's New Research Center into Social Data

How the Music Flows from Place to Place

Deep Blue Turns 15

Watson wins Jeopardy! - trick or triumph

Watson wins on Jeopardy


To be informed about new articles on I Programmer, install the I Programmer Toolbar, subscribe to the RSS feed, follow us on, Twitter, Facebook, Google+ or Linkedin,  or sign up for our weekly newsletter.





or email your comment to: comments@i-programmer.info


Learn Machine Learning Algorithms From Scratch With Python

Learn to implement 10 Machine Learning algorithms from scratch with just Python and NumPy. A library hides the implementation details and if you're really looking to understand what goes behind the co [ ... ]

Why Six Degrees Of Separation?

It is a remarkable fact that in a social network there are usually only six people linking any two members - this is often called the "small world effect" because picking any two people in the world a [ ... ]

More News



Last Updated ( Sunday, 16 December 2012 )