close
close

AI models fed AI-generated data quickly spew nonsense

AI models fed AI-generated data quickly spew nonsense

Training artificial intelligence (AI) models on AI-generated text quickly leads to the models churning out nonsense, a study has found. This cannibalistic phenomenon, termed model collapse, could halt the improvement of large language models (LLMs) as they run out of human-derived training data and as increasing amounts of AI-generated text pervade the Internet.

“The message is, we have to be very careful about what ends up in our training data,” says co-author Zakhar Shumaylov, an AI researcher at the University of Cambridge, UK. Otherwise, “things will always, provably, go wrong”. he says.” The team used a mathematical analysis to show that the problem of model collapse is likely to be universal, affecting all sizes of language model that use uncurated data, as well as simple image generators and other types of AI.

The researchers began by using an LLM to create Wikipedia-like entries, then trained new iterations of the model on text produced by its predecessor. As the AI-generated information — known as synthetic data — polluted the training set, the model’s outputs became gibberish. The ninth iteration of the model completed a Wikipedia-style article about English church towers with a treatise on the many colors of jackrabbit tails (see ‘AI gibberish’).

More subtly, the study, published in Nature1 on 24 July, showed that even before complete collapse, learning from AI-derived texts caused models to forget the information mentioned least frequently in their data sets as their outputs became more homogeneous.

This is a concern when it comes to making AI models that represent all groups fairly, because low-probability events often relate to marginalized groups, says study co-author Ilia Shumailov, who worked on the project while at the University of Oxford, UK.

“This is a fantastic paper,” says Julia Kempe, a computer scientist at New York University in New York City. Until now, many technology firms have improved their models by feeding them larger and larger amounts of data. But as human-produced content runs out, they are hoping to use synthetic data to keep improving. The study — a version of which first appeared on the arXiv preprint server in May 2023 — has spurred the AI ​​community to try to find solutions to the problem, she says. “It’s been a call to arms.”

You are what you eat

Language models work by building up associations between tokens — words or word parts — in huge swathes of text, often scraped from the Internet. They generate text by spitting out the statistically most probable next word, based on these learned patterns.

AI gibberish

The study authors trained their large language model on Wikipedia articles and trained successive generations of the model on the text produced by the previous version. Prompted to follow on from a paragraph of text from the Wikipedia entry on Grade I listed buildings in Somerset, the models output the following text. The first output from the model (generation 0) contains some errors, but the ninth generation spews complete gibberish.

Model generation 0

Revival architecture such as St. John’s Cathedral in London. The earliest surviving example of Perpendicular Revival architecture is found in the 18th @-@ century Church of Our Lady of Guernsey, which dates from the late 19th century. There are two types of perpendicular churches: those.

Model generation 9

architecture. In addition to being home to some of the world’s largest populations of black @-@ tailed jackrabbits, white @-@ tailed jackrabbits, blue @-@ tailed jackrabbits, red @-@ tailed jackrabbits, yellow @-

To demonstrate model collapse, the researchers took a pre-trained LLM and fine-tuned it by training it using a data set based on Wikipedia entries. They then asked the resulting model to generate its own Wikipedia-style articles. To train the next generation of the model, they started with the same pre-trained LLM, but fine-tuned it on the articles created by its predecessor. They judged the performance of each model by giving it an opening paragraph and asking it to predict the next few sentences, then comparing the output to that of the model trained on real data. The team expected to see errors crop up, says Shumaylov, but were surprised to see “things go wrong very quickly”, he says.

Collapse happens because each model necessarily samples only from the data it is trained on. This means that words that were infrequent in the original data are less likely to be reproduced, and the probability of common ones being regurgitated is boosted. Complete collapse eventually occurs because each model learns not from reality, but from the previous model’s prediction of reality, with errors getting amplified in each iteration. “Over time, those errors end up stacking up on top of each other, to the point where the model basically only learns errors and nothing else,” says Shumailov.

The problem is analogous to inbreeding in a species, says Hany Farid, a computer scientist at the University of California, Berkeley. “If a species inbreeds with their own offspring and doesn’t diversify their gene pool, it can lead to a collapse of the species,” says Farid, whose work has demonstrated the same effect in image models, producing eerie distortions of reality2.

Synthetic data problems

Model collapse does not mean that LLMs will stop working, but the cost of making them will increase, says Shumailov.

As synthetic data build up in the web, the scaling laws that state that models should get better the more data they train on are likely to break — because training data will lose the richness and variety that comes with human-generated content, says Kempe.

How much synthetic data is used in training matters. When Shumailov and his team fine-tuned each model on 10% real data, alongside synthetic data, collapse occurred more slowly. And model collapse has not yet been seen in the ‘wild’, says Matthias Gerstgrasser, an AI researcher at Stanford University in California. A study by Gerstgrasser’s team found that when synthetic data didn’t replace real data, but instead accumulated alongside them, catastrophic model collapse was unlikely3. It is unclear what happens when a model trains on data produced by a different AI, rather than its own.

Developers might need to find ways, such as watermarking, to keep AI-generated data separate from real data, which would require unprecedented coordination by big-tech firms, says Shumailov. And society might need to find incentives for human creators to keep producing content. Filtering is likely to become important, too — for example, humans could curate AI-generated text before it goes back into the data pool, says Kempe. “Our work4 shows that if you can prune it properly, the phenomenon can be partly or maybe fully avoided,” she says.