[ExI] AI Model Collapse

Fri May 30 15:27:56 UTC 2025

Since we have been talking about AIs recursively self-improving on the 
lists, I thought this was a pertinent article that could affect the 
timeline to AGI. There is a phenomenon called AI model collapse which 
occurs when AIs are trained on their own output. This produces an echo 
chamber effect which reinforces hallucinations, biases, and 
misinformation resulting in a degradation of output quality. Since these 
days, much of the output of AI gets put on the Internet and then new AI 
models get trained on the Internet, their training data becomes 
contaminated with AI output and can lead to AI model collapse. This is 
the informatic equivalent of biological inbreeding where deleterious 
mutations get amplified and reinforced in a genetic lineage resulting in 
all sorts of pathologies.

https://www.nature.com/articles/s41586-024-07566-y

Abstract
Stable diffusion revolutionized image creation from descriptive text. 
GPT-2 (ref. 1), GPT-3(.5) (ref. 2) and GPT-4 (ref. 3) demonstrated high 
performance across a variety of language tasks. ChatGPT introduced such 
language models to the public. It is now clear that generative 
artificial intelligence (AI) such as large language models (LLMs) is 
here to stay and will substantially change the ecosystem of online text 
and images. Here we consider what may happen to GPT-{n} once LLMs 
contribute much of the text found online. We find that indiscriminate 
use of model-generated content in training causes irreversible defects 
in the resulting models, in which tails of the original content 
distribution disappear. We refer to this effect as ‘model collapse’ and 
show that it can occur in LLMs as well as in variational autoencoders 
(VAEs) and Gaussian mixture models (GMMs). We build theoretical 
intuition behind the phenomenon and portray its ubiquity among all 
learned generative models. We demonstrate that it must be taken 
seriously if we are to sustain the benefits of training from large-scale 
data scraped from the web. Indeed, the value of data collected about 
genuine human interactions with systems will be increasingly valuable in 
the presence of LLM-generated content in data crawled from the Internet.
---------------------------------

Stuart LaForge