Tracking the Explosive World of Generative AI

"Next to Impossible": OpenAI's ChatGPT Faces GDPR Compliance Woes

Amid a temporary ban in Italy, OpenAI's ChatGPT confronts the difficult task of achieving GDPR compliance, with European legal experts deeming the prospect of adhering to the regulations "next to impossible."

OpenAI faces numerous challenges in complying with the EU's GDPR guidelines. Photo illustration: Artisana

🧠 Stay Ahead of the Curve

Italy’s temporary ban on OpenAI's ChatGPT leaves OpenAI with a tight deadline to comply with its requests.
However, European legal experts predict OpenAI's compliance with GDPR regulations could be "next to impossible."
The ChatGPT ban highlights broader concerns about data collection practices and the need for AI companies to prioritize data privacy, as regulations rapidly develop around AI technology.

By Michael Zhang

April 20, 2023

OpenAI's ChatGPT, after its temporary ban by Italy, now has less than two weeks to implement corrective measures. However, European legal experts predict it may be "next to impossible" for OpenAI to comply with Italy's regulations and the broader GDPR requirements. Failure to comply may result in severe consequences, from financial penalties to an outright ban of ChatGPT.

AI Model Construction Under Scrutiny

At the heart of the matter is OpenAI's methodology for building their AI models. AI models require vast quantities of data, much of which is publicly scraped and collected without user consent. OpenAI's GPT-2 model utilized 40 GB of text, while GPT-3 used 570 GB. OpenAI has refused to disclose the data used for GPT-4, frustrating researchers.

Italy's data regulator banned ChatGPT on the grounds that it breached GDPR regulations, stating there "appears to be no legal basis underpinning the massive collection and processing of personal data" used to train the algorithms. Italy's decision sparked similar investigations in France, Germany, Ireland, and Canada, prompting the EU's Data Protection Board to establish a task force for coordination and enforcement regarding ChatGPT.

Corrective Measures Demanded by Italian Data Regulator

OpenAI has been asked to implement several corrective measures, including:

Obtaining consent from individuals to scrape their data or proving "legitimate interest" in data collection
Explaining to users how ChatGPT utilizes their data
Allowing users to correct inaccuracies about them produced by the chatbot
Enabling users to request data erasure
Offering users the option to revoke consent for ChatGPT to use their data

Experts believe OpenAI’s data scraping as the most contentious compliance issue. OpenAI is unlikely to prove consent for the data used to train its AI models. But the "legitimate interest" test also poses a challenge, necessitating that companies offer rigorous reasons to justify using or retaining data without consent. The EU data regulator cites scenarios such as fraud prevention, network security, and crime prevention as valid reasons.

Margaret Mitchell, an AI researcher and ethics lead at Hugging Face, asserts that "OpenAI is going to find it near-impossible to identify individuals' data and remove it from its models." Mitchell previously served as Google's AI ethics co-lead.

Messy Data Collection is an AI Industry-Wide Problem

Historically, AI companies have viewed data collection as a means to an end, often neglecting accuracy and labeling. To gather the massive amounts of data needed to train their models, AI companies purchase bulk data from providers, use indiscriminate scrapers, and depend on contractors for basic filtering and error checking.

The Washington Post reported that many technology companies remain unaware of the contents of their training datasets. Even Google's heavily filtered Colossal Clean Crawled Corpus (C4) dataset, used for training various AI models, was found to contain content from white supremacist site Stormfront and unregulated online forum 4chan.

Google researcher Nithya Sambasivan concluded in a study that data practices are "messy, protracted, and opaque." In the end, Sambasivan noted these challenges arise because "everyone wants to do the model work, not the data work."