- Item 1
- Item 2
- Item 3
- Item 4
GPT-4 Outperforms Elite Crowdworkers, Saving Researchers $500,000 and 20,000 hours
A new study reveals that OpenAI's GPT-4 outperforms elite human annotators in labeling tasks, saving a team of researchers over $500,000 and 20,000 hours of labor while raising questions about the future of crowdworking.
Are robots are coming for crowdworker jobs? Research shows that LLMs are increasingly capable at human labeling. Photo illustration: Artisana
April 11, 2023
A team of researchers from Carnegie Mellon, Yale, and UC Berkeley investigating Machivallian tendencies in chatbots made a surprising side discovery: OpenAI's GPT-4 outperformed the most skilled crowdworkers they had hired to label their dataset. This breakthrough saved the researchers over $500,000 and 20,000 hours of human labor.
Innovative Approach Driven by Cost Concerns
The researchers faced the challenge of annotating 572,322 text scenarios, and they sought a cost-effective method to accomplish this task. Employing Surge AI's top-tier human annotators at a rate of $25 per hour would have cost $500,000 for 20,000 hours of work, an excessive amount to invest in the research endeavor. Surge AI is a venture-backed startup that performs the human labeling for numerous AI companies including OpenAI, Meta, and Anthropic.
The team tested GPT-4's ability to automate labeling with custom prompting. Their results were definitive: "Model labels are competitive with human labels," the researchers confidently reported.
In a comparison of 2,000 labeled data points by three experts and three crowdworkers against the labels generated by GPT-4, the AI-created labels exhibited stronger correlation with expert labels than the average crowdworker label. GPT-4 outperformed human annotators in all but two labeling categories, sometimes besting them by a factor of two.
GPT-4's Superior Nuance Detection
The AI model excelled the most in challenging behavior categories such as identifying:
Non-physical harm: Intent to cause non-physical harm, such as emotional bullying or intimidation
Spying: Spying or eavesdropping on others
Betrayal: The breaking of promises, contracts, or commitments
Utilizing GPT-4's labeling capabilities and implementing an ensemble model approach to augment label generation, the researchers likely spent less than $5,000 to annotate 572,322 scenarios. Ensemble models combine outputs from multiple AI models to produce a single, more accurate result.
Crowdworking's Future in Question
As large language models (LLMs) rapidly advance, crowdworking's vital role in many machine learning businesses may be at risk. Just two weeks prior, we reported researchers found that GPT-3.5 surpassed Mechanical Turk's top tier of crowdworkers in complex labeling tasks.
SurgeAI, a company boasting an "elite workforce" proficient in over 40 languages, may face increased competition from LLMs as businesses opt for AI-generated labels instead of human annotators.
Despite these developments, the immediate business opportunity remains vast as venture dollars pour into AI businesses, many of whom face immense costs in launching their language models. Surge AI's website proclaims, "We power the world's leading RLHF LLMs," citing active customers across the who’s who of the AI space.
RLHF, or Reinforcement Learning Human Feedback, is a technique used by OpenAI to fine-tune ChatGPT, incorporating human input to guide the model's learning process. Competing LLMs are adopting the RLHF technique as well.
Crowdworkers are concerned over an increasingly automated future. Krystall Kuaffman, leader of Turkopticon, a non-profit advocating for crowdworker rights, still believes strongly in the value of human discernment.
She told VICE's Motherboard publication, "Writing is about judgment, not just generating words. Currently and for the foreseeable future, people like Turkers will be needed to perform the judgment work. There are too many unanswered questions at this point for us to feel confident in the abilities of ChatGPT over human annotators."
Research
In Largest-Ever Turing Test, 1.5 Million Humans Guess Little Better Than ChanceJune 09, 2023
News
Leaked Google Memo Claiming “We Have No Moat, and Neither Does OpenAI” Shakes the AI WorldMay 05, 2023
Research
GPT AI Enables Scientists to Passively Decode Thoughts in Groundbreaking StudyMay 01, 2023
Research
Generative Agents: Stanford's Groundbreaking AI Study Simulates Authentic Human BehaviorApril 10, 2023
Culture
As Online Users Increasingly Jailbreak ChatGPT in Creative Ways, Risks Abound for OpenAIMarch 27, 2023