As reported in a peer reviewed paper just published by Loyola Marymount University a chatbot avatar based on our Discourse chatbot engine recently significantly passed a modified Turing Test run in the virtual world of Second Life. The highlights from the paper are:
- The first natural language Turing Test conducted in a 3D virtual environment.
- 78% of participants incorrectly judged a chatbot to be human.
- Agency judgments were due to the quality of the AI engine and psychological factors.
- Passage of the Turing Test involves a complex process of human–computer interaction.
The test itself was based on a test we set out in another paper back in 2009. There we described how most current Turing Tests were very unbalanced, since both judges and "hidden humans" knew they were taking part in a test, and often the only person who didn't was the chatbot! We described how virtual environments such as Second Life offered an ideal level playing field, since everyone, human or chatbot, looked like an avatar, could have access to the same information about the environment, and crucially the expectation of most people in the world was that the other avatars they talk to are other humans - so all the bot has to do is not give away the fact that it was a bot, rather than try and convince an already sceptical judge that it was a human!
In the modified Turing Test that we proposed - a "covert Turing Test" - subjects (aka judges) went into a virtual world for supposedly another purpose, and whilst there had a conversation with a bot (or human). Only after the event were they asked whether they thought the avatar they were talking to was a bot or a human.
This is exactly the test that Loyola Marymount set up. You can read the full details and results in the paper, but the essence was that subjects were invited into Second Life to give comment on a new shop layout and design, and that there would be a shop assistant in there to help them. 50% of the time the assistant was a chatbot, and 50% a human. A nice touch in the double-blind design was that the humans playing the assistant also didn't know that they were taking part in a Turing Test, so they couldn't act "over-human", or even "more robotic" - as often happens in physical Turings.
The very simple avatar used by both the humans and the chatbot.
The bottom line was that 78% of the test subjects did not recognise the avatar as being a chatbot when run by the chatbot - well in excess of the 30% target set in the full Turing Test, or even a 50% random choice.
So what does this mean?
Well first we don't see the Turing Test as a measure of "intelligence" or "sentience" - it is purely a reasonable test for how good a chatbot is at mimicing human conversation. And seeing as having "human-like" conversations with computers could be useful for a whole range of things (training, health-care etc) then it's a reasonable test of the state of the art.
The problem has been that most current Turing Test implementations are VERY artificial - I've been a "hidden human" and I know how unnatural the conversations are on both sides. With the covert Turing we were trying to create a more "fit for purpose" test - can a human tell the difference between a human and a chatbot in a practical, realistic setting with no pre-conceptions - the sort of test we need if we want to deploy chatbots into real world applications. Yes it is a lower bar than a full-on Turing Test (which itself is almost taking on the artificiality of a linguistic chess game), but for us it is a very valid waymark on the route to the full Turing Test.
PS. You may be able to get a free copy of the article by following the PDF link against the articles listing in the journals' contents page, rather than from the article page!