close
close

Tech companies including Apple have been caught using YouTube data to train AI models

Tech companies including Apple have been caught using YouTube data to train AI models

YouTube Feature

Apple, Nvidia, Anthropic, and Salesforce have all been caught using YouTube data to build their AI models.

A survey conducted by Proof of current events and co-edited with Cable Researchers have discovered that YouTube caption data was extracted from the video-sharing platform without permission and used to train AI models. These are not video frames.

The data has been used to train LLMs (Large-Language Models), like ChatGPT, but this raises the issue of tech companies stealing YouTube data to train models.

YouTube has expressly stated that using videos to train AI is a violation of the platform’s terms of service. But it’s widely acknowledged that YouTube is a goldmine of data for generative AI as the race for text-to-video models heats up.

About 180,000 YouTube videos were found in the dataset used by Apple et al. The data was compiled by a non-profit organization and is called The Pile. It contains not only YouTube data, but also Wikipedia articles, books, and Enron emails.

“The Pile includes a very small subset of YouTube captions,” says Jennifer Martinez, a spokesperson for Anthropic. Proof of current events.

“YouTube’s Terms of Service cover direct use of its platform, which is separate from use of The Pile dataset. For potential violations of YouTube’s Terms of Service, we refer you to the authors of The Pile.”

Apple, Nvidia and others have not commented. Neither has YouTube.

Nobody wants to talk about training data

After some preliminary testing, tech companies are reluctant to talk about where the training data they use to create generative AI models comes from.

As OpenAI’s Sora video generator looms on the horizon, CTO Mira Murati has repeatedly declined to reveal training data for the high-profile app.

“I’m not going to go into the details of what data was used, but it was publicly available or licensed data,” she said. The Wall Street Journal in March.

YouTube CEO Sundar Pichai said The edge that the use of the platform’s video content — including subtitles — constitutes a violation of the terms of use.

“We have terms and conditions, and we expect people to abide by them when you build a product, so that’s how I felt,” Pichai said.


Image credits: Header photo licensed via Depositphotos.