Apple Just Got Caught Training AI on YouTube Videos Without Consent

Apple is the latest in a long line of generative AI developers — a list almost as old as the industry itself — to have been caught scraping copyrighted content from social media to train its artificial intelligence systems.

According to a new report from Proof News, Apple allegedly used a dataset containing captions from 173,536 YouTube videos to train its AI. However, Apple isn’t alone in this breach, despite YouTube’s specific rules prohibiting the use of this data without permission. Other AI heavyweights have also been caught using it, including Anthropic, Nvidia, and Salesforce.

The dataset, known as YouTube Subtitles, contains video transcripts from more than 48,000 YouTube channels, from Khan Academy, MIT, and Harvard to the Wall Street Journal, NPR, and the BBC. Even transcripts from late-night variety shows like “The Late Show With Stephen Colbert,” “Last Week Tonight with John Oliver,” and “Jimmy Kimmel Live” are part of the YouTube Subtitles database. Videos from YouTube influencers like Marques Brownlee and MrBeast, as well as a number of conspiracy theorists, have also been stolen without permission.

The dataset itself, which was compiled by startup EleutherAI, does not contain any video files, though it does include a number of translations into other languages, including Japanese, German, and Arabic. EleutherAI reportedly got its data from a larger dataset called Pile, itself created by a nonprofit organization that pulled its data not only from YouTube, but also from the European Parliament archives and Wikipedia.

BloombergAnthropogenic and Data Bricks The companies’ publications also indicate that the models trained on The Pile are very limited. “The Pile includes a very small subset of YouTube captions,” Anthropic spokesperson Jennifer Martinez said in a statement to Proof News. “YouTube’s terms of service cover direct use of its platform, which is separate from use of The Pile dataset. On the point about potential violations of YouTube’s terms of service, we must refer you to The Pile’s authors.”

Beyond the technicalities, AI startups appropriating content from the open internet has been a problem since ChatGPT launched. Stability AI and Midjourney are currently facing lawsuits from content creators who accuse them of ripping off their copyrighted works without permission. Google, which operates YouTube, was hit with a class-action lawsuit last July and another in September, which the company said would “deal a serious blow not only to Google’s services, but to the very idea of generative AI.”

Me: What data was used to train Sora? YouTube videos?
OpenAI CTO: Actually, I’m not sure…

(I really encourage you to watch the whole thing @WSJ Interview in which Murati answered many of the biggest questions about Sora. Full interview, ironically, on YouTube:… pic.twitter.com/51O8Wyt53c

– Joanna Stern (@JoannaStern) March 14, 2024

Moreover, these same AI companies have a hard time disclosing where they get their training data. In a March 2024 interview with Joanna Stern of the Wall Street Journal, OpenAI CTO Mira Murati repeatedly stumbled when asked whether her company used videos from YouTube, Facebook, and other social media platforms to train its models. “I’m just not going to get into the details of the data that was used,” Murati said.

Last July, Microsoft AI CEO Mustafa Suleyman argued that an ethereal “social contract” means that everything on the web is fair game.

“I think since the 1990s, the social contract that applies to content that’s already on the open web is that it’s fair use,” Suleyman told CNBC. “Anyone can copy it, recreate it, reproduce it. It’s freeware, if you will, that’s the understanding.”