<p dir="ltr">I think the word "indiscriminate" is doing a lot of heavy lifting here. Afaik, all of the major AI companies are now using highly curated text based for training, which do include text generated by other LLMs and possibly their own, but not in an indiscriminate manner. The paper describes a degenerate case where an LLMs output becomes the majority of its own training corpus, and this is not something that has been observed in the real world yet.</p>
<br><div class="gmail_quote gmail_quote_container"><div dir="ltr" class="gmail_attr">On Fri, May 30, 2025, 9:30 AM Stuart LaForge via extropy-chat <<a href="mailto:extropy-chat@lists.extropy.org">extropy-chat@lists.extropy.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Since we have been talking about AIs recursively self-improving on the <br>
lists, I thought this was a pertinent article that could affect the <br>
timeline to AGI. There is a phenomenon called AI model collapse which <br>
occurs when AIs are trained on their own output. This produces an echo <br>
chamber effect which reinforces hallucinations, biases, and <br>
misinformation resulting in a degradation of output quality. Since these <br>
days, much of the output of AI gets put on the Internet and then new AI <br>
models get trained on the Internet, their training data becomes <br>
contaminated with AI output and can lead to AI model collapse. This is <br>
the informatic equivalent of biological inbreeding where deleterious <br>
mutations get amplified and reinforced in a genetic lineage resulting in <br>
all sorts of pathologies.<br>
<br>
<a href="https://www.nature.com/articles/s41586-024-07566-y" rel="noreferrer noreferrer" target="_blank">https://www.nature.com/articles/s41586-024-07566-y</a><br>
<br>
Abstract<br>
Stable diffusion revolutionized image creation from descriptive text. <br>
GPT-2 (ref. 1), GPT-3(.5) (ref. 2) and GPT-4 (ref. 3) demonstrated high <br>
performance across a variety of language tasks. ChatGPT introduced such <br>
language models to the public. It is now clear that generative <br>
artificial intelligence (AI) such as large language models (LLMs) is <br>
here to stay and will substantially change the ecosystem of online text <br>
and images. Here we consider what may happen to GPT-{n} once LLMs <br>
contribute much of the text found online. We find that indiscriminate <br>
use of model-generated content in training causes irreversible defects <br>
in the resulting models, in which tails of the original content <br>
distribution disappear. We refer to this effect as ‘model collapse’ and <br>
show that it can occur in LLMs as well as in variational autoencoders <br>
(VAEs) and Gaussian mixture models (GMMs). We build theoretical <br>
intuition behind the phenomenon and portray its ubiquity among all <br>
learned generative models. We demonstrate that it must be taken <br>
seriously if we are to sustain the benefits of training from large-scale <br>
data scraped from the web. Indeed, the value of data collected about <br>
genuine human interactions with systems will be increasingly valuable in <br>
the presence of LLM-generated content in data crawled from the Internet.<br>
---------------------------------<br>
<br>
Stuart LaForge<br>
<br>
<br>
<br>
<br>
_______________________________________________<br>
extropy-chat mailing list<br>
<a href="mailto:extropy-chat@lists.extropy.org" target="_blank" rel="noreferrer">extropy-chat@lists.extropy.org</a><br>
<a href="http://lists.extropy.org/mailman/listinfo.cgi/extropy-chat" rel="noreferrer noreferrer" target="_blank">http://lists.extropy.org/mailman/listinfo.cgi/extropy-chat</a><br>
</blockquote></div>