|Watson Wins Jeopardy! - Trick or Triumph|
|Written by Mike James|
|Friday, 18 February 2011|
IBM's Watson finished the recent contest between man and machine with $77,147, compared to $24,000 for Ken Jennings and $21,600 for Brad Rutter, another top Jeopardy! champion. This is amazing and there is plenty of talk of the "day of the machine". But wait! Watson doesn't think or understand anything. It's not even a question-answering machine - but an answer-questioning machine which is perhaps a whole lot simpler. So is it a triumph of machine over man or of publicity over fact?
IBM has pulled off a triumph of publicity, if not AI (artificial intelligence), in creating a machine that can beat seasoned players at the game of Jeopardy!. As well as the publicity from the show, IBM made campus appearances and generally generated enthusiasm among students. The result is that now IBM looks like a cool company to work for and has moved from grey and outdated to being up there with Google, Twitter and Facebook. It has also raised the public perception of AI and what computers can do beyond word processing and browsing the web. If you are semantic engineer or an expert in machine learning then you can expect to be in more demand in the future.
For this at least IBM deserves thanks but is Watson AI or is it a trick?
How AI works
AI is a strange subject because by its nature it is doomed to be perceived has having failed. Consider just how it works. A human does something like play chess or answer questions on Jeopardy! and we immediately credit the behaviour as an example of intelligence. Intelligence is what humans and occasionally some animals are assumed to have without having to prove anything much apart from engaging in the activity.
Now compare this to a chess playing program. At first it looks impressive, especially when it wins, but when you examine how it does the job you discover that its an easy to follow algorithm. You can find out exactly how a chess program works in terms of searching and evaluating the next move in terms of what might happen n moves on. Even the most sophisticated variations on the algorithm seem simple and crude. Even though the program can beat a grand master, like IBM's previous AI stunt with Deep Blue, it just doesn't seem to be made of the same stuff as human intelligence.
If you ask what would be required of an AI program to impress you enough to be worth calling intelligent then what you end up demanding is a large slice of "mystery". Every time AI succeeds in reducing a human behaviour to an algorithm it immediately changes from intelligence to a machine procedure. With this in mind it is time too look at Watson.
Watson - a statistical approach to AI
The reason why Watson is impressive is that to complete the task it has to bring together a range of separate AI techniques. It has to understand natural language well enough to process the question and formulate an answer. But first notice that Jeopardy! is the reverse of a standard quiz show. It provides the answers and the contestants have to formulate the questions. There are also elements of question selection and betting to be mastered but the main AI task is to formulate a question given the answer.
This task would seem to need complete understanding at a very human level but you would probably be wrong. The key idea in most of the really successful applications of AI in the last few years has been statistics.The statistical approach to AI may have produced many successes - Google Translate for example - but many regard it as being unsatisfying in the sense described earlier.
Suppose we have the very simple AI task of writing a program to guess the number you have just thought of - yes it's silly but illustrates the point. A valid AI approach would be to try to use subtle hints from your psychology and recent experiences to formulate a model of your cognitive functions and so work out your most likely numeric selection. The statistical approach would simply get you to play the game millions of times and work out statistically what number was most likely. The statistical approach to language understanding has only recently become possible because of the huge amount of language data that the web provides.
What Watson does is to take the input question and use some syntax analysis on it, but not with the intention of understanding the question - just to split it into functional fragments. These are then used to discover what the question is by a statistical process of searching the knowledge base for something that has entries that correspond to the data in the question. When an entity is found that matches the features of the question then it is considered as a possible answer. A set of heuristic confidence levels are computed and the final answer is constructed from the entity that has the best confidence level. The confidence level is also used to determine if Watson should buzz in or not.
For example suppose the question was:
Category: "Rap" Sheet
Clue: This archaic term for a mischievous or annoying child can also mean a rogue or scamp.
Then the clue is processed to create fragments such as "mean" "rogue or scamp" these are then looked up in the knowledge base and if an entry has "rogue or scamp" or "mean(s)" and "rogue or scamp" then the entry is the possible subject of the answer.
Lexical Answer Type
Notice that this sort of matching would be done on multiple fragments and syntax would be used to guide the sort of results that rank highest. It is all a question of working out what the entry is that the information is about - the LAT or Lexical Answer Type. In this case the LAT is "This archaic term" i.e. we are looking for a word. Knowing the LAT allows Watson to pick out the the item in the matched record that is the answer. In the case of this example the entry has to be a word - Rapscallion in this case. Watson can then perform a simple transformation to get the question form of the answer "What is Rapscallion?".
Of course this misses out lots of the detail in what Watson is doing but it gives you flavour of the overall approach. The categories that the questions fall into can be used to narrow down the LAT. The question also needs to be treated in different ways depending on its overall type. For example,
Category: Diplomatic Relations
Clue: Of the four countries in the world that the United States does not have diplomatic relations with, the one that's farthest north.
In this case the Category isn't of any real use in working out the LAT - it isn't a diplomatic relation. The question itself reveals that the LAT is a set of countries but this is a tough thing to work out. The answer algorithm also has to be customised find the one of the four items that satisfies the first part of the clue "the four countries in the world that the United States does not have diplomatic relations with" that also satisfies the second "farthest north". Even so you can see that with enough time and tweaking you can eventually construct an algorithm that works most of the time given a big enough sample of questions.
Overall Watson is a huge complex conglomerate of algorithms.
Don't misunderstand, this isn't a criticism - to be able to create such a system is a remarkable feat but it does tend to produce AI systems that are "brittle" and likely to fail in ways that make a human gasp.
For example, the final Jeopardy question in the second round of the contest was:
"Its largest airport is named for a World War II hero; its second largest, for a World War II battle" in the category of "U.S. Cities,"
Watson buzzed in and answered "What is Toronto?"
The audience gasped as the answer was stupidly wrong. How could this happen?
Without more data we can't be sure but Watson must have not used the category "U.S. Cities" to constrain the solution item to be a U.S. City - statistically this would have been the correct thing to do but the question also didn't contain fragments that pinned it down to a US city either. So the wrong candidate item was picked. Watson doesn't understand anything.
A massively parallel machine
Even so, by the end of the third round Watson had achieved a convincing win and there must be a great future for him and his technology. To make this approach work fast enough the algorithms have to be massively paralleled. Fortunately because so many of the sub tasks don't interact this is fairly easy. However, to get the time to answer down from 2 hours in the first version took a lot of hardware - a Linux based cluster using up to 90 IBM Power 750 servers with 16TBytes of RAM occupying 10 racks using 80 kwatts of power.
Watson may have beaten the humans but it took a lot of computing power - a total of 2800 Power 7 cores.
IBM has plans to use the DeepQA algorithms in commercial products but there are a number of issues that haven't been answered. The approach used by Watson is brittle and while it might not matter that it made a laughing stock of itself by seeming to "think" that Toronto was a US city, similar mistakes in other areas of application would be no laughing matter.
There is also the small issue that Jeopardy! has a format that suits the approach that Watson takes, or rather suits statistical AI. Being provided with the answer and having to find the question isn't the same as being asked a question and finding the answer. In Jeopardy! the clue is long and has lots of information that allows the algorithms to pin down an entry in a knowledge base corresponding to candidate items that can then be used to form a simple question with few words - like "What is Toronto?"
It often doesn't even matter that the form of the question isn't quite right for the answer given as the clue. Watson isn't a question answering machine - it is an answer questioning machine and it could well be that this is the easier of the two directions. In short, unless IBM can find applications that mimic Jeopardy!, Watson might well be a very expensive dead end.
Only time will tell if Watson is some sort of breakthrough to a new commercialization of AI, but for now we need to give it the credit for doing a very difficult task well and for revitalising AI in the mind of the public.
Well done Watson...
|Last Updated ( Friday, 04 March 2022 )|