The Winograd Schema Challenge is a new annual competition designed to judge whether a computer program has truly modeled human level intelligence. The first submission deadline is October 1, 2015 and a grand prize of $25,000 will be awarded to the winning program that passes the test.
The competition is sponsored by Nuance Communications, a providers of voice and language solutions in cooperation with Commonsense Reasoning, which is dedicated to furthering and promoting research in the field of formal commonsense reasoning. It is proposed as an alternative to the Turing Test for establishing whether a machine is capable of producing behaviour that requires thought in people
It was back in 1950 in a paper entitled "Can Machines Think?" that Alan Turing proposed a test for this that he called the Imitation Game and which later became known as the Turing Test.
The test involved a human holding a conversation with a concealed entity (either a machine or a human). Turing himself suggested that a computer program that could convince human judges that they were conversing with another human 30% of the time would "win" his test.
The outcome of an event held at the Royal Society in London to mark the 60th anniversary of Turing's death was that a computer program Turing Test by convincing judges 33% of the time in a set of 150 trials. But, as we reported at the time, far from demonstrating a computer's ability to think, the program known as Eugene Goostman served only to bring the Turing Test, as administered by Kevin Warwick and a team from the University of Reading, into disrepute.
Eugene Goostman's creators, Vladimir Veselov, who was born in Russia and now lives in the United States, and Ukrainian-born Eugene Demchenko, came up with a clever idea - to account for its lack of knowledge and its awkward personality they gave the program the personality of a 13-year old Ukrainian boy.
This exacerbated the problems already inherent in the format of the test. For example those of humans pretending to be machines and judges looking for clues like poor keyboard skills - a chatbot is likely to deliver its replies quickly and without typos whereas a human is likely to be slower and less accurate. In this age of being able to find answers to factual questions almost immediately and general knowledge is no longer a good discriminator between humans and computers.
All in all once the Turing Test had been reduced to a chatbot contest it no longer addressed Turing's original question of "can computers demonstrate a human-like ability to think?" and a new test was obviously required.
The alternative on which the new competition is based comes from Hector Levesque, Professor of Computer Science at the University of Toronto, and winner of the 2013 IJCAI Award for Research Excellence for his work on a variety of topics in knowledge representation and reasoning. He called it the Winograd Schema Challenge as it elaborates ideas from Terry Winograd who is known for developing an AI-based framework for understanding natural language.
In his 2011 paper Levesque states:
Like the the original [Turing test] it involves responding to typed English sentences, and English-speaking adults will have no difficulty with it. Unlike the original, the subject is not required to engage in a conversation and fool an interrogator into believing she is dealingwith a person. Moreover, the test is arranged in such a way that having full access to a large corpus of English text might not help much. Finally, the interrogator or a third party will be able to decide unambiguously after a few minutes whether or not a subject has passed the test.
As a example of a Winograd Schema question, consider the following:
The trophy would not fit in the brown suitcase because it was too big. What was too big? Answer 0: trophy / Answer 1: suitcase
This is an ambiguous question because "it" could refer either to the trophy or to the suitcase. The "right" answer is immediately obvious to a human who will draw on knowledge about the relatives sizes of suitcases and trophies and it is probably not subtle enough to fool a computer.
But others are more subtle.
Consider the example derived from Winograd':
The town councillors refused to give the angry demonstrators a permit because they feared violence.Who feared violence?
Levesque has four rules for producing suitable sentences and their questions:
Two parties are mentioned in a sentence by noun phrases. They can be two males, two females, two inanimate objects or two groups of people or objects.
A pronoun or possessive adjective is used in the sentence in reference to one of the parties, but is also of the right sort for the second party. In the case of males, it is "he/him/his"; for females, it is "she/her/her" for inanimate object it is "it/it/its," and for groups it is "they/them/their."
The question involves determining the referent of the pronoun or possessive adjective. Answer 0 is always the first party mentioned in the sentence (but repeated from the sentence for clarity), and Answer 1 is the second party.
There is a word (called the special word) that appears in the sentence and possibly the question. When it is replaced by another word (called the alternate word), everything still makes perfect sense, but the answer changes.
To see how the final rule introduces complexity when the words are not opposites like "big" and "small" consider:
Paul tried to call George on the phone, but he was not __ Who was not __? Answer 0: Paul / Answer 1: George special: successful / alternate: available
For the annual competition, which is open to individuals or teams, the test will consist of at least 40 Winograd Schemas, with a non-repetitive set of test questions supplied each year.
The 2015 Commonsense Reasoning Symposium, to be held at the AAAI Spring Symposium at Stanford from March 23-25, 2015, will include a special session for presentations and discussions on progress and issues related to the Winograd Schema Challenge.
The new test sounds as if it might be more difficult than the current chatbot-oriented circus, but as it is formulated it is more specific. With only 40 Winograd schemas it seems very likely that a look up approach can be invented leading to just another collection of "chatbots" targeted at particular types of language analysis. Language does sometimes embody the full range of human intelligence, but whenever it is restricted it becomes possible to reproduce it without mastering AI.