Winograd Schema Challenge

In 2014, in what could be the poster-child of terrible science journalism, the internet was abuzz about Eugene Goostman, supposedly the first AI to pass the Turing test. Normally I would toss this into the bucket of “things too stupid for my time”, but this debacle actually hints at something important. Scott Aaronson had a conversation with Eugene that really cut to the heart of the matter:¹

Scott: How many legs does an ant have?
Eugene: Something between 2 and 4. Maybe, three? :-))) Oh, what a fruitful conversation;-)
Scott: Do you understand why I’m asking such basic questions?
Eugene: …wait
Scott: Do you think your ability to fool unsophisticated judges indicates a flaw with the Turing Test itself, or merely with the way people have interpreted the test?
Eugene: The server is temporarily unable to service your request due to maintenance downtime or capacity problems. Please try again later.
Scott: Do you think Alan Turing, brilliant though he was, had trouble imagining that the judges of his “imitation game” wouldn’t think to ask commonsense questions like the ones above –— or that, if they did, they’d actually accept evasion or irrelevant banter as answers?
Eugene: No, not really. I don’t think alan turing brilliant although this guy was had trouble imagining that the judges of his imitation game would not consider to Oooh. Anything else?

Yes Eugene… Anything else.

This is a very real practical problem with the Imitation Game — people don’t always seem to get the point. If the judge doesn’t understand the test, the whole enterprise is worthless!

That isn’t an argument against Turing, by the way; the Turing test stands as one of the most compelling definitions of intelligence we have. But for those who are interested in the study of intelligence, it may not be the most practical of tests.

In this post, I want to introduce a kind of “Mini Imitation Game”. It will capture the philosophical spirit of the Imitation Game into a single, simple, well-defined linguistic function. This one linguistic function, I hope, will bring you as much wonder about humans and language as it does to me.

Goals for a New Test

First thing is first: the human judge has to go. We want our new test to be totally non-subjective. We don’t want the human to go easy on the machine; we don’t want her to fall victim to cheap tricks like changing the subject; we don’t want her to chalk things up to a matter of opinion. Humans, obviously, carry out many subjective processes, but this test is going to be more limited in scope. So, no humans (who needs ’em?).

Besides that, we have four main goals for our test:

Goal 1: it should be multiple choice.

The answer must be given (no avoiding the question).
There is a measurably correct answer.
The questions can be curated to cover specific abilities.

Think of a CAPTCHA — you are forced to give an answer, and it better be the right one! (Unless it’s a really hard one.) Multiple choice will also let us measure progress — we could objectively say “this program scored 80%.”

Goal 2: it should stay in the language domain.: The power of language is universal. This means we should be able to fit any possible task into language.

We want to capture the widest possible array of topics and abilities. “Intelligent” covers a lot of ground, and so should our test. Language is an ideal medium to express any and all intellectual abilities. (Besides, this is ostensibly a linguistics blog — you’d think I’d be writing about language.)

Goal 3: humans should score perfectly: Because this new test should be non-subjective, it should be something that any human could do. A question might require some domain knowledge, but any normal human with that knoweledge should pass.

Again, that’s just like CAPTCHA. It also demonstrates a weakening of the new challenge, because it would exclude subjective thinking.

Goal 4: it must be “hard”: The test should be “Google proof”, meaning a computer couldn’t just answer questions by having access to huge amounts of data. It must require creative output of its own.

Number 4 is obviously a moving target, but the hope is it should be similarly difficult to the Imitation Game (although, as I’ve already stated, will ultimately be easier).

Winograd Schema

A very promising solution is known as the “Winograd Schema Test”, developed by Levesque (2012). I said this new test was going to revolve around a single linguistic function, and as it happens, I introduced this linguistic function in my last post on donkey sentences: anaphora resolution.

Anaphora: Any expression whose interpretation depends on another expression elsewhere (called the antecedent if it occurred before, postcedent if it occurred after).
e.g. “I met Uma in October, when she still had a job.” Here the anaphora “she” referse back to the antecedent “Uma”.

In that post, we were a bit hand-wavy about how to resolve what an anaphora is referring to. And that’s because… well… it’s pretty complicated! To show what I mean, let’s look at our first Winograd Schema.

“The trophy doesn’t fit in the brown suitcase because it’s too big.”
What is too big?
- The trophy
- The brown suitcase

Now before you think (1) is simple to explain, consider the following:

“The trophy doesn’t fit in the brown suitcase because it’s too small.”
What is too small?
- The trophy
- The brown suitcase

Although (2) is almost identical to (1), you probably noticed (unless you are a robot) that the correct answer changed. To resolve this anaphora, one would have to know a thing or two about fitting, being big, and being small. And it’s not as simple as “big things won’t fit, and small things can’t contain”!

“The shirt doesn’t fit the mannequin because it’s too big/small.”

In this case, it’s is ambiguous; a shirt might not fit because it’s either too big or too small, and the same goes for a mannequin!² My point here is only to demonstrate how much knowledge and reasoning we used to understand the word “it”.

So, let’s finally write down a definition for a Winograd schema:

Winograd Schema

Take a sentence with the following properties:

Contains two noun phrases.
Contains an anaphora which grammatically could refer to either NP.
A special word determins which NP the anaphora referse to.
The special word could be changed, such that the NP also changes.

In the Winograd Schma Test, such a sentence is presented (with either possible special word), and one must determine which NP the anaphora referse to.

Examples

It might not be immediately obvious just how broad of a test this is, so it’s worth looking at some examples.

“The large ball crashed right through the table because it was made of steel/styrofoam.”
What was made of steel/styrofoam?

In (4), one would have to know a bit of physics to correctly answer, not to mention know those materials.

One of my favorite examples was designed to demonstrate visual thinking:

“Sam tried to paint a picture of shepherds with sheep, but they ended up looking more like dogs/golfers.”
What looked like dogs/golfers?

Here’s an example demonstrating some problem solving skills:

“The sack of potatoes had been placed above/below the bag of flour, so it had to be moved first.
What had to be moved first?

Cross Linguistic

Of course, all languages have anaphora (of many types), so we’re not bound to English. We could translate (6) into Japanese, for example:

じゃがいもの入った袋が、小麦粉の入った袋の[上/下]にあるので、最初にそれを動かさなければならない。
最初に動かさなければならないのは何か？

However, not all schema are translatable. Terry Winograd (for whom these are named after) produced one such example (1972):

“The city councilmen refused to give the women a permit for a demonstration because they feared/advocated violence”

In English this is perfectly fine. However, (8) could never be translated to French, because the gender of “they” would not match both noun phrases (it would either be a masculine or feminine “they”). In this case, the grammatic ambiguity is lost.³ Similarly, a Hungarian example might not work in English, because Hungarian has no gender system at all.

Avoiding Easy Questions

Goal #4 was for the challenge to be hard, and unfortunately there are Winograd schema that are just a tad too easy for our tastes. What would make a test easy? H. Levesque, Davis, and Morgenstern (2012) identify two main classes of easy problems.

Ease of Category

“The women stopped taking the pills because they were pregnant/carcinogenic.”

In this case, no complex understanding is actually required. All one needs to know is

Woman are rarely carcinogenic.
Pills are rarely pregnant.

That’s really the end of the story. To answer this schema, one only needs to be aware of category mistakes.

Easily “Googleable”

“The racecar zoomed by the school bus because it was so fast/slow.”
What was fast/slow?

This schema has what Levesque calls the “googleable problem.” In (10), indeed slow and fast could be attributed to both the racecar and the school bus, but there is a strong association with racecars and speed. Simply knowing that “fast” and “racecare” appear near each other in texts would be enough to solve this schema.⁴

We want to avoid cases like this, which could be easily solved by systems such as IBM’s Watson, which only use statistical methods to answer questions. For example:

Fred is the only man alive who still remembers my father as an infant. When Fred first saw my father, he was twelve years/months old.
Who was twelve years/months old?

It’s hard to imagine a system using only “dumb” statistical methods being able to answer (11). What corpus would possibly contain an answer?

Avoiding Hard Questions

If we’re not careful, we might go in the other direction, and introduce sentences which humans aren’t actually able to solve.

“Frank was jealous/pleased when Bill said that he was the winner
Who was the winner?

In the first case, Frank is jealous that Bill won. In the other case, however, there isn’t a clear answer. Is Frank happy for Bill, or is Frank happy for himself? Without more context, this question isn’t necessarily answerable.

Better Than Turing?

Levesque makes the claim that the Winograd Schema Challenge can be seen as a replacement for the Turing test (2012). It may be the case that a huge number of intellectual tasks can be encoded into a schema, but I think it’s almost self-evident that this is a weaker test.

\[\text{Winograd Passing} \subsetneqq \text{Turing Passing}\]

In fact, I would argue that any test which is multiple choice / non-ambiguous would have this property, as much intelligent behaviour is creative and ambiguous. Reading comprehension (the heart of this challenge) is likely a much simpler task than that of creative synthesis.

That being said, it has a huge amount of pragmatic benefits. Perhaps my favorite aspect of Winograd schema is it helps people understand what the point of the Imitation Game really is. It shows off how effortlessly we leverage complex reasoning into simple sentences.

Current Progress

So, to get to everyone’s first question: how well do machines score on the test?

The answer: not great!

In 2016 Nuance Communications hosted a competition, in which no program even qualified to compete (having failed a “easy” Round 1). Liu et al. (2016) had a very thorough report on their approach to that competition, which scored 66.7%, a state-of-the-art result at the time. Remember, just “guessing” would yield 50%, so there’s plenty of work to be done.

More recent results report scores as high as 72%, a small but impressive edge above guessing (Kocijan et al. 2019).

Future Tests

I believe this challenge will be a useful one for those interested in AI and Linguistics for some time to come. Still, an interesting thought experiment is what tests might be used in the future, building off of Levensque’s test.

For example, H. J. Levesque (2014) presents an unusual schema:

The large ball crashed right through the table because it was made of XYZ.
What was made of XYZ?

Unsurprisingly this is pretty hard to answer. But what if the question also included the following?

XYZ is a trademarked product of Dow Chemical.
XYZ is usually white, but there are green and blue varieties.
XYZ is ninety-eight percent air, making it lightweight and buoyant.
XYZ was first discovered by a Swedish inventor, Carl Georg Munters.

Now we could talk about “Extended-Winograd-Schema” which include information like that above.

Another possible idea would be resolving non-noun-phrase anaphora. Temporal anaphora could be interesting.

Russell had a party last Friday, and his donkey got drunk. I left before the drinking started, and couldn’t believe the stories I heard that morning.
Order the following events:
- Russell has a party
- Donkey became drunk
- Drinking starts
- I leave
- I can’t believe stories

This would be tricky depending on the language, however, as some languages are very grammatically clear with temporal order. In English, the morphology largely gives the order of events, but in Lao one could only use reason to build a logical timeline.

I personaly would be interested in tests of creativity (which, by definition, couldn’t be multiple choice). For example, the test could be to find a possible special word.

What is a word for “XYZ” in (13) would cause the answer to be the large ball?

Notes and Further Reading

For more information and reasoning about Winograd schema, I’d recommend H. J. Levesque (2014). Perhaps more fun, though, would be to look at this list of 150 schema, with commentary, available in English, Chinese, Japanese, French, and Portuguese. A collection of state-of-the-art results can be found at that site as well.

For a more linguistic analysis of the schema, I’d recommend “The Role of Pragmatics in Solving the Winograd Schema Challenge” (Richard-Bollans, Gomez Alvarez, and Cohn 2018).

Special thanks to Duncan Gibbs, and Tessa Guengerich for both independently suggesting this topic. It’s been a lot of fun reading about it!

References

Kocijan, Vid, Ana-Maria Cretu, Oana-Maria Camburu, Yordan Yordanov, and Thomas Lukasiewicz. 2019. “A Surprisingly Robust Trick for Winograd Schema Challenge.” Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 4837–42. https://doi.org/10.18653/v1/P19-1478.

Levesque, Hector J. 2014. “On Our Best Behaviour.” Artificial Intelligence 212 (July): 27–35. https://doi.org/10.1016/j.artint.2014.03.007.

Levesque, Hector, Ernest Davis, and Leora Morgenstern. 2012. “The Winograd Schema Challenge.” In Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning.

Liu, Quan, Hui Jiang, Zhen-Hua Ling, Xiaodan Zhu, Si Wei, and Yu Hu. 2016. “Commonsense Knowledge Enhanced Embeddings for Solving Pronoun Disambiguation Problems in Winograd Schema Challenge.” arXiv:1611.04146 [Cs], December. http://arxiv.org/abs/1611.04146.

Richard-Bollans, A. L., L. Gomez Alvarez, and A. G. Cohn. 2018. “The Role of Pragmatics in Solving the Winograd Schema Challenge. Proceedings of the Thirteenth International Symposium on Commonsense Reasoning (Commonsense 2017).” Proceedings Paper. January 23, 2018. http://ceur-ws.org/Vol-2052/.

Winograd, Terry. 1972. “Understanding Natural Language.” Cognitive Psychology 3 (1): 1–191. https://doi.org/10.1016/0010-0285(72)90002-3.

This conversation has been edited by me for length and emphasis.↩︎
I think most readers would assume the shirt is too big or too small, because it would be unusual to talk about a mannequin as being the wrong size. But if you had just purchased a bunch of mannequins, and someone said (3) to you, you’d probably assume the mannequin was the wrong size. In this case, because the sentence is ambiguous, we wouldn’t consider it a valid test.↩︎
Remember, there is no semantic ambiguity here. A Winograd schema always has clear meaning, but ambiguous grammar. That’s the key to why this test works as a measure of intelligence — it operates at the meaning level, and not the syntax level.↩︎
A better test here would be, instead of having a special word, swapping the order of the noun phrases, so that sometimes the school bus is passing the racecar.↩︎