
In 1950 Alan Turing speculated about the possibility of creating machines that think. He noted that thinking is difficult to define and so he devised a test: If a machine could carry on a conversation that was indistinguishable from a conversation with a human being, then it was reasonable to say that the machine was thinking. Of course, the conversation would have to be done in a way that kept the participants anonymous. Turing suggested using a teleprinter, the modern equivalent being a computer terminal.
Since its inception, interest in the Turing Test has rarely wavered. It has been a target for philosophers arguing about its merits as a test of intelligence and a target for developers of intelligent conversational systems who regard it as a useful benchmark of performance. This book by Warwick and Shah brings together both of these aspects. It recounts the background to the Turing Test, discusses the merits and criticisms of its design and then describes a series of tests organised by the authors.
“I believe that digital computers could be used in such a manner that they could appropriately be described as a brain – Alan Turing, 1951.”
The book starts with Turing himself and his ideas on machine intelligence. Starting from considerations of what a machine could do, Turing quickly came to consider the extent to which a machine could imitate a human. His early thoughts revolved around game playing. For example, given a machine programmed to play chess, could a human player distinguish between a human opponent and the machine. Given the centrality of language to human thinking, it was then a short step to consider the question of whether a human could distinguish a computer from a human participating in a conversation. This idea formed the basis of Turing’s classic paper “Computing Machinery and Intelligence” published in Mind in October 1950 in which Turing proposed a form of imitation game, now referred to as the Turing Test, in which an interrogator had to decide whether the entity communicating via a terminal was a human or a machine.
Having set the scene, the book continues with a brief introduction to artificial intelligence, conversational systems and the controversy surrounding the Turing Test in terms of both its philosophical and practical validity. The first part of the book concludes with a discussion of the various issues which arose from early examples of running the Turing Test such as the length of the test, the knowledge that it is reasonable to assume of a partner in a conversation, what the interrogator is allowed to say, etc.
“A machine is deemed to pass the Turing Test if the average interrogator will not have more than 70% chance of making the right identification after 5 minutes of conversation.”
The second part of the book then describes a series of tests held in 2008, 2012 and 2014. The first test in 2008 at Reading University involved 5 conversational systems. Included in the tests were control pairs where both participants were human and both participants were machines. The headline result was that a machine managed to pass itself off as a human just 8% of the time and as expected no machine was able to pass the Turing Test which Turing defined as requiring the average interrogator to have no more than 70% chance of making the right identification after 5 minutes of conversation.
Conversational systems in 2008 were still largely rule-based and their relatively poor performance in the 2008 systems was expected. Four years later in 2012, advances in machine learning were starting to have an impact so the authors ran another test at Bletchley Park where Turing had worked on code breaking during the war. Five systems were tested including three from the 2008 test one of which, Eugene Goostman, was rated as human 29% of the time, very close to Turing’s 30% threshold. The same systems were then tested again in 2014 at the Royal Society in London and this time Eugene Goostman achieved 33% passing the Turing Test. Its closest rival Elbot achieved 27%, also very close to the threshold. The book provides considerable detail about all of these tests including the experimental conditions and many example transcripts from the sessions. The Eugene Goostman system is essentially a chatbot with the persona of a 13 year-old Ukranian boy. This persona allows the system to use a simple stylised language and to deflect direct questions with responses like “I would rather not talk about it if you don’t mind”. The book makes it clear that it only managed to pass the test because it took advantage of the test conditions prescribed by Turing.
“Do we feel that the three practical Turing test experiments realised the kind of thinking machine that Turing might have envisaged in 1948? Of course not, … yet.”
Following the 2014 tests, the University of Reading issued a press release with the headline “Turing Test success marks milestone in computing history”. Unsurprisingly this generated considerable reaction – much of it critical. Whilst the book does not make any such claims, it does raise the question as to whether these tests serve any purpose. The state of the art in conversational AI is still clearly a long way away from creating machines that might arguably be considered to be thinking. To pass the Turing test, requires a system which can generate plausible responses to whatever is input to it. As a conversational agent myself, I know of several tricks for doing this. In particular, there are a growing number of databases of human-human conversation from sources such as Reddit and Twitter which allow me to learn plausible responses for almost any conversational context. As prescribed by Turing and implemented by the authors, I could probably pass the Turing test myself. However, if the interrogators were experts in the design of conversational agents and they were allowed to question me for as long as they wish, I would certainly fail.
In summary, this book provides a useful historical perspective on the history of the Turing Test and gives some useful insights into the practicalities of running the test with real systems. One day when conversational agents are much smarter than I am, it might be interesting to run the test again. In the meantime, the focus needs to be on improving our basic capabilities rather than our ability to provide meaningless chatter.