Tracking the Explosive World of Generative AI

GPT-4 Outperforms Elite Crowdworkers, Saving Researchers $500,000 and 20,000 hours

A new study reveals that OpenAI's GPT-4 outperforms elite human annotators in labeling tasks, saving a team of researchers over $500,000 and 20,000 hours of labor while raising questions about the future of crowdworking.

Are robots are coming for crowdworker jobs? Research shows that LLMs are increasingly capable at human labeling. Photo illustration: Artisana

By Michael Zhang

April 11, 2023

A team of researchers from Carnegie Mellon, Yale, and UC Berkeley investigating Machivallian tendencies in chatbots made a surprising side discovery: OpenAI's GPT-4 outperformed the most skilled crowdworkers they had hired to label their dataset. This breakthrough saved the researchers over $500,000 and 20,000 hours of human labor.

Innovative Approach Driven by Cost Concerns

The researchers faced the challenge of annotating 572,322 text scenarios, and they sought a cost-effective method to accomplish this task. Employing Surge AI's top-tier human annotators at a rate of $25 per hour would have cost $500,000 for 20,000 hours of work, an excessive amount to invest in the research endeavor. Surge AI is a venture-backed startup that performs the human labeling for numerous AI companies including OpenAI, Meta, and Anthropic.

The team tested GPT-4's ability to automate labeling with custom prompting. Their results were definitive: "Model labels are competitive with human labels," the researchers confidently reported.

In a comparison of 2,000 labeled data points by three experts and three crowdworkers against the labels generated by GPT-4, the AI-created labels exhibited stronger correlation with expert labels than the average crowdworker label. GPT-4 outperformed human annotators in all but two labeling categories, sometimes besting them by a factor of two.

GPT-4's Superior Nuance Detection

The AI model excelled the most in challenging behavior categories such as identifying:

  • Non-physical harm: Intent to cause non-physical harm, such as emotional bullying or intimidation

  • Spying: Spying or eavesdropping on others

  • Betrayal: The breaking of promises, contracts, or commitments

Utilizing GPT-4's labeling capabilities and implementing an ensemble model approach to augment label generation, the researchers likely spent less than $5,000 to annotate 572,322 scenarios. Ensemble models combine outputs from multiple AI models to produce a single, more accurate result.

Table 10 of the research paper shows how GPT-4 labels outperform in 16/18 label categories. Source: arXiv

Crowdworking's Future in Question

As large language models (LLMs) rapidly advance, crowdworking's vital role in many machine learning businesses may be at risk. Just two weeks prior, we reported researchers found that GPT-3.5 surpassed Mechanical Turk's top tier of crowdworkers in complex labeling tasks.

SurgeAI, a company boasting an "elite workforce" proficient in over 40 languages, may face increased competition from LLMs as businesses opt for AI-generated labels instead of human annotators.

Despite these developments, the immediate business opportunity remains vast as venture dollars pour into AI businesses, many of whom face immense costs in launching their language models. Surge AI's website proclaims, "We power the world's leading RLHF LLMs," citing active customers across the who’s who of the AI space.

RLHF, or Reinforcement Learning Human Feedback, is a technique used by OpenAI to fine-tune ChatGPT, incorporating human input to guide the model's learning process. Competing LLMs are adopting the RLHF technique as well.

Crowdworkers are concerned over an increasingly automated future. Krystall Kuaffman, leader of Turkopticon, a non-profit advocating for crowdworker rights, still believes strongly in the value of human discernment.

She told VICE's Motherboard publication, "Writing is about judgment, not just generating words. Currently and for the foreseeable future, people like Turkers will be needed to perform the judgment work. There are too many unanswered questions at this point for us to feel confident in the abilities of ChatGPT over human annotators."

Read More: ChatGPT