The Turing Test was passed this month (sorta)

24. June 2014 · Write a comment · Categories: artificial intelligence, linguistics, technology · Tags: artificial intelligence, cognition, discourse markers, human-computer interaction, linguistics, Turing Test, txtspk

The Turing Test is a measure of artificial intelligence or, perhaps more accurately, linguistic mimicry. The nature of the test is simple: a judge sits at a computer, and chats (as you would on any instant messenger app) for five minutes. At the end of the five minutes, the judge decides whether their conversational partner was a human or a computer. The bar to achieve a “passing” grade was set by the creator of the test, Alan Turing: a machine fooling 30% of human judges into thinking it was human would pass. In 1950, Turing predicted that the feat would be achieved by the year 2000.

Spoiler alert: It wasn’t.

Also, Turing committed suicide in 1954. #depressing

However, a few weeks ago, for the first time in history, it was announced that the Turing Test had been passed! They passed the test on the 60th anniversary of Turing’s death, no less! The media went wild! And by wild, I mean it did some seriously bad science reporting. The Independent claimed a super computer was responsible, and immediately delved into the darkness of future possibilities of cybercrime now that our machine overlords have achieved sentience. On the flip side, WIRED gave the event a grade of “F” and basically said the test doesn’t prove anything at all about artificial intelligence, so everyone should just shut up about the whole thing.

So first of all, a supercomputer is generally one with a lot of gigabytes terabytes of storage and gigahertz of processing and loads of RAM and what-have-you. Even though our tiny smartphones have more computer power than NASA’s Apollo mission, even modern super computers can still take up whole rooms. The program which beat the Turing Test is just a piece of software. It’s an app, like CandyCrush or Microsoft Word, except a lot better. Second, we’re still a really long way’s off from sentience or “strong AI”. At best, this is “weak AI”, so we really don’t have to worry about HAL taking command of the ship just quite yet.

As for the Wired article, I don’t think this should be dismissed with a failing grade. Yes, 30% would be a failing grade in a classroom, but 30% is the benchmark Turing set, so 30% counts as passing in this case. And that level of achievement had never been met before! So this event is huge progress for the AI community, regardless of where you set the bar! However, I do think the bar should be set to 50% – or, rather, not statistically significantly different from 50%. The judges are given two choices: human, or computer. This choice is basically a coin flip. If a human judge’s choice doesn’t pick the correct answer any better than a coin flip, then that to me would be a true “passing” grade for the Turing Test.

So what’s next? Well, first of all, “Eugene Goostman”, the character portrayed this program, is a 13-year-old boy. This allows the programmers to get away with a lot. As the saying goes, “boys will be boys.” If he comes off as rude or aloof, a judge might dismiss that as early teenage angst. If he can’t spell or write very well, a judge might dismiss that as a lack of education because let’s face it he’s not even in high school yet. And if he can’t carry on an intelligent conversation about the Iraq War or the Game of Thrones, well 13-year-old boys usually aren’t very interested in politics and they probably aren’t allowed to be watching what is essentially an HBO softcore. To create true artificial “intelligence”, we need to up the ante and require intelligent, adult conversation.

But that’s just the personality and repertoire of stuff to talk about. What about the important bit: the linguistics? In real spoken conversation, each person’s turn is relatively short, and is usually broken up into even shorter intonation units. Live conversation requires a lot of cognitive processing, and our short term memory can generally only handle 7±2 units of information at a time. In writing, sentences can be much longer, because we take a long time to think about what we’re writing and we can go back and edit it and reread it and think about it some more. Online chat, twitter, texting, and the like tend to pattern more like speech than like writing, despite the fact that it is written. Take a look at this excerpt from the WIRED article, from a brief online conversation they had with Eugene Goostman:

WIRED: Where are you from?
Goostman: A big Ukrainian city called Odessa on the shores of the Black Sea
WIRED: Oh, I’m from the Ukraine. Have you ever been there?
Goostman: ukraine? I’ve never there. But I do suspect that these crappy robots from the Great Robots Cabal will try to defeat this nice place too.

WIRED, a human, has short utterances. None of their sentences are more than 7 words long, and in fact average only 4.6 words each. Meanwhile, Goostman’s sentences average 9.5 words and reach up to 21 words long! The humans at WIRED also include a discourse marker: “oh”. Discourse markers are little words like oh, mhm, yeah, okay, y’know, uh-huh, and hmm which keep a conversation going. A certain amount of polite laughter, such as lol or haha, can help make you seem more human too, especially online when other cues such as head nods or eye gaze aren’t available.

To sum: I think this is a huge success, and it’s fantastic we were able to honor the 60th anniversary of Turing’s death with this achievement. But we’re still a very long way from Strong AI, and I think linguists need to dive in and help. Eugene needs to grow up into an adult, use shorter sentences and more discourse markers, and we need to raise the bar to a 50% pass rate.

You must be logged in to leave a reply.