General Discussion

highplainsdem

(49,378 posts) Wed Aug 2, 2023, 07:53 PM Aug 2023

When AI Is Trained on AI-Generated Data, Strange Things Start to Happen (Futurism)

This new article in Futurism is about a fundamental problem with generative AI like ChatGPT and Midjourney that I posted about June 21 - https://www.democraticunderground.com/100218028323 - but that was about an Atlantic article and different researchers.

This new article has a long interview with the Rice University researchers.

https://futurism.com/ai-trained-ai-generated-data-interview

-snip-

To understand this phenomenon better, we spoke to machine learning researchers Sina Alemohammad and Josue Casco-Rodriguez, both PhD students in Rice University's Electrical and Computer Engineering department, and their supervising professor, Richard G. Baraniuk. In collaboration with researchers at Stanford, they recently published a fascinating — though yet to be peer-reviewed — paper on the subject, titled "Self-Consuming Generative Models Go MAD."

MAD, which stands for Model Autophagy Disorder, is the term that they've coined for AI's apparent self-allergy. In their research, it took only five cycles of training on synthetic data for an AI model's outputs to, in the words of Baraniuk, "blow up."

-snip-

We summarize this in what we call an autophageous loop. That's a technical term that basically just means self-consuming. Maybe think of an animal, not just chasing his tail, but eating his tail. We also like the analogy to Mad Cow Disease — feeding cows to other young cows in an ever-repeating cycle that leads to brain-destroying pathogens. So basically, our work has been studying these self-consuming loops, and understanding when models go MAD, if you will. When bad things happen, and what to do if you don't want bad things to happen.

-snip-

Baraniuk: This a really important long-term question. There's no question that MADness has the potential to significantly reduce the quality of the data on the internet. Just the quality of the data. And our work in this particular paper hasn't really dealt with the kind of AI systems used, let's say, in search engines. So it's a bit too early to tell. But there is some other work out there that actually shows that if you train a different, non-generative AI system — some other kind of AI system like the kind used in search engines — if you train that AI system using synthetic data in addition to real data, performance actually goes down.

-snip-

Much more at the link.

So generative AI results are poisoning the internet, and the problem will get worse as more AI-generated text and images are uploaded. And, as pointed out in that last paragraph of the excerpt, there's already evidence that non-generative AI like basic search engines will be harmed by this, with performance going down.

That Atlantic article I posted about in June ended by pointing out that even if generative-AI results from being given synthetic data don't end in complete "model collapse" - which they likened to dementia - the more subtle deterioration will be like lead poisoning:

https://www.theatlantic.com/technology/archive/2023/06/generative-ai-future-training-models/674478/

The chatbots might not eat themselves so much as leach undetectable traces of cybernetic lead that accumulate across the internet with time, poisoning not just their own food and water supply, but humanity’s.

This new Futurism article ends with the lead researcher saying their study began with an offhand remark by a researcher in another field who predicted there would soon be more AI-generated websites than real websites online.

7 replies

= new reply since forum marked as read

Highlight:

When AI Is Trained on AI-Generated Data, Strange Things Start to Happen (Futurism) (Original Post) highplainsdem Aug 2023 OP

Feedback loops and signal degradation masmdu Aug 2023 #1

K&R Alice Kramden Aug 2023 #2

Thanks! highplainsdem Aug 2023 #4

This is fantastick LearnedHand Aug 2023 #3

So, the defense against cognitive AI is to use more AI ? NotASurfer Aug 2023 #5

The defense is to invest in the education of our children Ms. Toad Aug 2023 #7

This is really not qualitatively different from what I have observed Ms. Toad Aug 2023 #6

masmdu

(2,538 posts)

1. Feedback loops and signal degradation

Reply to highplainsdem (Original post)

Wed Aug 2, 2023, 08:01 PM

Aug 2023

Alice Kramden

(2,192 posts)

2. K&R

Reply to highplainsdem (Original post)

Wed Aug 2, 2023, 08:07 PM

Aug 2023

For visibility

highplainsdem

(49,378 posts)

4. Thanks!

Reply to Alice Kramden (Reply #2)

Wed Aug 2, 2023, 09:32 PM

Aug 2023

LearnedHand

(3,433 posts)

3. This is fantastick

Reply to highplainsdem (Original post)

Wed Aug 2, 2023, 08:12 PM

Aug 2023

You instinctively know the shit is only a couple of inches beneath the surface in ai's "intelligence" because it's trained on data scraped from the open internet. Now they've named a syndrome (!!) for what happens when ai eats its own shit?? By logical extension, what does that say about people who consume ai data as if it were meaningful?

NotASurfer

(2,176 posts)

5. So, the defense against cognitive AI is to use more AI ?

Reply to highplainsdem (Original post)

Wed Aug 2, 2023, 09:41 PM

Aug 2023

We can reduce the product of decades of research and development to babbling incoherent gibberish that easily?

Ms. Toad

(34,423 posts)

7. The defense is to invest in the education of our children

Reply to NotASurfer (Reply #5)

Wed Aug 2, 2023, 10:24 PM

Aug 2023

with a focus on source/fact verification in the digital age.

This is not a new problem - I am aware of it since the late 90s (at least). The proliferation of garbage data had been increasing both with near-universal access to the internet (without the skills or inclination to fact check), and with the increasing echo chambers we all fall into absent a conscious effort to expose ourselves to divergent thinking.

Everyone needs to be able to run a source to the ground - and to recognize when the interpretation of the source modifies the content. Our schools used to teach that (I learned it in the 70s) but - by and large - is not teaching it anymore (my daughter who graduated from high school in 2008 was not taught it). We need to reinvest in digital literacy, and we need to find a way to pull in the lost generations who missed out on it (or perhaps learned it and forgot it).

Ms. Toad

(34,423 posts)

6. This is really not qualitatively different from what I have observed

Reply to highplainsdem (Original post)

Wed Aug 2, 2023, 10:19 PM

Aug 2023

with ordinary search engines/social media - or even court opinions when fact-checking is lax.

In court opinions, it is common to cite to a prior case and rephrase the legal concept. In my time drafting opinions for my judge's signature, I ALWAYS tracked the citation back to the original case, and cited the original case..

In more incidents than I can count on both hands, the proposition in the case I wanted to cite (since it was on point) bore little resemblance to the original case that generated it. Sometims I traced back through a half-dozen cases to find the original case. The intermediate cases each cited the prior case as citing the original (so I was able to track back to the original). Paraphrases were inserted in quotation marks when they weren't actual quotations - and, in some instances, the proposition I wanted to cite the original case for had been perverted enough that the original case actually contradicted the most recent case.

The discssion we had a few days ago about ChatGPT and the bar exam is another perfect example. The original research, although poorly designed, did accurately describe what it did: administer an exam which simulated the UBE, with half of the questions being those used in July 2022 - and the other have being a set of 200 questions which were never included as part of the exam. It described the score they assigned to ChatGPT - and compared the score to passing scores in various jurisdicitons. That morphed through several reports into headlines that ChatGPT actually passed the bar exam, no mention that it the exam it took was not the real exam, and no mention that passing is jurisdiciton-specific (the UBE is not a direct admission ticket to the bar; it is merely the exam which generates a score which some jurisdicitons use for admission to the bar exam.

In both of these examples, there is original - real -data. The intermediate steps involve synthetic data - data generated by someone or something (in this case all human) misinterpreting the data - which leads to the final step which is several steps removed from the original data by the intermediate synthetic data.

It happens much faster with AI, but it is not fundamentally different that the examples I've described - or (for example) the MAGA echo chamber recycling and embelishing its lies as synthetic data, on which new MAGA lies are built.

It also happens daily on DU - when someone posts a headline (even one with a link) and no one even bothers to follow the link - let alone verify the content at the link and the thread devolves into a conversation that goes off the rails in ways that are completely unrelated to the original source.

That's why I keep harping on the need to focus - early in the education of our children - on the process of source verifiction. Whether the most recent source is AI, a search engine, facebook, or a court case. If the source isn't verified (as I verified both the court cases and the actual research), and all we rely on for our next steps is intermediate synthetic data, the output will be unrecognizable (as compared to the actual data).

I have been watching this deterioration of the ability (and willingness) to verify the content we are using specifically expressly since 1998.

Reply to this discussion