- Item 1
- Item 2
- Item 3
- Item 4
In Largest-Ever Turing Test, 1.5 Million Humans Guess Little Better Than Chance
In the largest-ever Turing-style test, 1.5 million human users tried to discern the latest AI chatbots from actual human conversations. The surprising outcome? Their guesswork was barely better than chance.
New AI chatbots are remarkably strong at fooling human users, a wide-ranging study reveals. Photo illustration: Artisana.
🧠Stay Ahead of the Curve
An expansive study saw 1.5 million users partake in the largest-ever Turing-style test, discerning AI from humans in chat interactions.
Despite users' strategies to detect AI, the accuracy of identification was barely better than chance, revealing the advanced capabilities of modern AI.
The results signify a milestone in AI development, suggesting that distinguishing between human and AI interaction is becoming an increasingly complex task.
June 09, 2023
In a recently published study, a team of Israeli researchers from AI lab AI21 conducted the largest-ever Turing-style test, with a total of 1.5 million users completing over 10 million chat sessions.
Human users, given the task of discerning whether their chat counterpart was human or AI, only correctly guessed that they were interacting with AI in 60% of conversations – a success rate "not much higher than chance," the researchers pointed out.
Their research highlights the vast improvements large language models have made in recent months as human-like outputs are increasingly possible from chatbots, and software infused with chatbot capabilities is increasingly common.Â
Research Methodology
The Turing test, first proposed by scientist Alan Turing in 1950, was designed as a thought experiment to test if machine behavior could be indistinguishable from a human. For this study, participants engaged in 2-minute chat sessions in which they were randomly paired with other human users or an AI chatbot.
The researchers built a variety of AI chatbots utilizing multiple language models (GPT-4, Cohere, and AI21's own Jurassic-2), adding advanced prompting techniques to give each bot a character and a variety of character traits that would increase their believability and personality.
Surprising Results as Humans Underestimate AI
When human users engaged with another human, they correctly identified their partner as human 73% of the time. However, the researchers found that when human users chatted with an AI bot, they only correctly identified their bot partner as an AI 60% of the time -- barely better than chance.
The research also revealed fascinating 'player personas': some were adept at spotting bots, others excelled at identifying humans, some were highly skilled at signaling their own humanity, and some players even impersonated AI bots.Â
Users tested a variety of strategies to identify bots, such as asking about current events, spelling, time, social media trends, and posing questions in foreign languages. But not all of these were effective.
Users often associated grammatical errors and typos with human behavior, correctly guessing their partner was a bot only 54.7% of the time when the bot utilized typos.Â
Personal questions were another stumbling block for humans: bots were coded with backstories that enabled them to invent convincing details, which led human users to correctly identify bots only 60% of the time.
Another finding showed that vulgar bots were correctly identified only 52% of the time, revealing a human tendency to associate politeness with AIs.Â
As for current events and social media trends, bots successfully deceived participants 45% of the time, indicating their impressive ability to generate convincing responses to topical issues, such as the war in Ukraine.
One successful strategy to identify bots, however, did emerge. Human users who made complex or illegal requests of their chat partners were able to correctly identify bots 65% of the time in such situations, suggesting that limitations inherent in commercial LLMs can provide valuable clues about their artificial nature.
In a fascinating twist, humans who tried to impersonate bots were still identified as humans 75% of the time by their counterparts. This highlights the complex task of mimicking a chatbot convincingly and provides a revealing insight into the perceptual biases that underpin our interactions with AI.
Acknowledging Limitations
The researchers were careful to acknowledge the limitations of their study. The game context might amplify suspicion and scrutiny, they acknowledge, and awareness of potential interaction with AI could influence strategies and approaches as well.Â
Furthermore, the time-limited nature of the game led to quicker judgments, and the AI models used were specifically designed for this context, which may limit the generalizability of the findings. The participant pool was also limited to English-speaking users. All of these factors could vary in other settings.
At the same time, the research team highlighted the value of continuing to use Turing-style tests to grade AI performance, no matter how flawed. At a time when AI chatbot progress is rapidly accelerating, this provides a unique bench into the performance of AI while capturing a snapshot of how humanity itself is adapting to a world where interacting AI partners could become common.
In "the inevitable near future," the researchers propose, is one that "will commingle humans and AI" and magnify the importance of further understanding how human interactions work.
News
Leaked Google Memo Claiming “We Have No Moat, and Neither Does OpenAI” Shakes the AI WorldMay 05, 2023
Research
GPT AI Enables Scientists to Passively Decode Thoughts in Groundbreaking StudyMay 01, 2023
Research
GPT-4 Outperforms Elite Crowdworkers, Saving Researchers $500,000 and 20,000 hoursApril 11, 2023
Research
Generative Agents: Stanford's Groundbreaking AI Study Simulates Authentic Human BehaviorApril 10, 2023
Culture
As Online Users Increasingly Jailbreak ChatGPT in Creative Ways, Risks Abound for OpenAIMarch 27, 2023