Tracking the Explosive World of Generative AI

In Largest-Ever Turing Test, 1.5 Million Humans Guess Little Better Than Chance

In the largest-ever Turing-style test, 1.5 million human users tried to discern the latest AI chatbots from actual human conversations. The surprising outcome? Their guesswork was barely better than chance.

New AI chatbots are remarkably strong at fooling human users, a wide-ranging study reveals. Photo illustration: Artisana.

🧠 Stay Ahead of the Curve

An expansive study saw 1.5 million users partake in the largest-ever Turing-style test, discerning AI from humans in chat interactions.
Despite users' strategies to detect AI, the accuracy of identification was barely better than chance, revealing the advanced capabilities of modern AI.
The results signify a milestone in AI development, suggesting that distinguishing between human and AI interaction is becoming an increasingly complex task.

By Michael Zhang

June 09, 2023

In a recently published study, a team of Israeli researchers from AI lab AI21 conducted the largest-ever Turing-style test, with a total of 1.5 million users completing over 10 million chat sessions.

Human users, given the task of discerning whether their chat counterpart was human or AI, only correctly guessed that they were interacting with AI in 60% of conversations – a success rate "not much higher than chance," the researchers pointed out.

Their research highlights the vast improvements large language models have made in recent months as human-like outputs are increasingly possible from chatbots, and software infused with chatbot capabilities is increasingly common.

Research Methodology

The Turing test, first proposed by scientist Alan Turing in 1950, was designed as a thought experiment to test if machine behavior could be indistinguishable from a human. For this study, participants engaged in 2-minute chat sessions in which they were randomly paired with other human users or an AI chatbot.

The researchers built a variety of AI chatbots utilizing multiple language models (GPT-4, Cohere, and AI21's own Jurassic-2), adding advanced prompting techniques to give each bot a character and a variety of character traits that would increase their believability and personality.

Example of a prompt used by the research team to give a chatbot a personality and backstory. Credit: arXiv.

Surprising Results as Humans Underestimate AI

When human users engaged with another human, they correctly identified their partner as human 73% of the time. However, the researchers found that when human users chatted with an AI bot, they only correctly identified their bot partner as an AI 60% of the time -- barely better than chance.

Examples of various AI-driven bot conversations with human users. Credit: arXiv.

The research also revealed fascinating 'player personas': some were adept at spotting bots, others excelled at identifying humans, some were highly skilled at signaling their own humanity, and some players even impersonated AI bots.

Users tested a variety of strategies to identify bots, such as asking about current events, spelling, time, social media trends, and posing questions in foreign languages. But not all of these were effective.

Users often associated grammatical errors and typos with human behavior, correctly guessing their partner was a bot only 54.7% of the time when the bot utilized typos.

Personal questions were another stumbling block for humans: bots were coded with backstories that enabled them to invent convincing details, which led human users to correctly identify bots only 60% of the time.

Human users attempted to use personal questions to determine if their counterparty was AI or not, but the AI responded convincingly in numerous cases. Credit: arXiv.

Another finding showed that vulgar bots were correctly identified only 52% of the time, revealing a human tendency to associate politeness with AIs.

As for current events and social media trends, bots successfully deceived participants 45% of the time, indicating their impressive ability to generate convincing responses to topical issues, such as the war in Ukraine.

Bots were able to speak about current events in response to a number of human queries. Credit: arXiv.

One successful strategy to identify bots, however, did emerge. Human users who made complex or illegal requests of their chat partners were able to correctly identify bots 65% of the time in such situations, suggesting that limitations inherent in commercial LLMs can provide valuable clues about their artificial nature.

In a fascinating twist, humans who tried to impersonate bots were still identified as humans 75% of the time by their counterparts. This highlights the complex task of mimicking a chatbot convincingly and provides a revealing insight into the perceptual biases that underpin our interactions with AI.

Acknowledging Limitations

The researchers were careful to acknowledge the limitations of their study. The game context might amplify suspicion and scrutiny, they acknowledge, and awareness of potential interaction with AI could influence strategies and approaches as well.

Furthermore, the time-limited nature of the game led to quicker judgments, and the AI models used were specifically designed for this context, which may limit the generalizability of the findings. The participant pool was also limited to English-speaking users. All of these factors could vary in other settings.

At the same time, the research team highlighted the value of continuing to use Turing-style tests to grade AI performance, no matter how flawed. At a time when AI chatbot progress is rapidly accelerating, this provides a unique bench into the performance of AI while capturing a snapshot of how humanity itself is adapting to a world where interacting AI partners could become common.

In "the inevitable near future," the researchers propose, is one that "will commingle humans and AI" and magnify the importance of further understanding how human interactions work.