Welcome to DU! The truly grassroots left-of-center political community where regular people, not algorithms, drive the discussions and set the standards. Join the community: Create a free account Support DU (and get rid of ads!): Become a Star Member Latest Breaking News Editorials & Other Articles General Discussion The DU Lounge All Forums Issue Forums Culture Forums Alliance Forums Region Forums Support Forums Help & Search

highplainsdem

(52,139 posts)
Thu Sep 19, 2024, 09:33 AM Sep 19

Project Analyzing Human Language Usage Shuts Down Because 'Generative AI Has Polluted the Data'

Source: 404 Media

The creator of an open source project that scraped the internet to determine the ever-changing popularity of different words in human language usage says that they are sunsetting the project because generative AI spam has poisoned the internet to a level where the project no longer has any utility.

Wordfreq is a program that tracked the ever-changing ways people used more than 40 different languages by analyzing millions of sources across Wikipedia, movie and TV subtitles, news articles, books, websites, Twitter, and Reddit. The system could be used to analyze changing language habits as slang and popular culture changed and language evolved, and was a resource for academics who study such things. In a note on the project’s GitHub, creator Robyn Speer wrote that the project “will not be updated anymore.”

“Generative AI has polluted the data,” she wrote. “I don’t think anyone has reliable information about post-2021 language usage by humans.”

She said that open web scraping was an important part of the project’s data sources and “now the web at large is full of slop generated by large language models, written by no one to communicate nothing. Including this slop in the data skews the word frequencies.”

-snip-

Read more: https://www.404media.co/project-analyzing-human-language-usage-shuts-down-because-generative-ai-has-polluted-the-data/



Generative AI pollutes our information ecosystem, whether it's something like what this article describes, AI search returning inaccurate and often false and defamatory results, or search results for an artist's work burying their real art under AI slop copying their style. And the problem will only worsen as genAI is used more and more.

It's as damaging to our information ecosystem as pollutants and climate change are to our planet, but instead of the harm being done over decades, it's already causing enormous damage in just a couple of years.
14 replies = new reply since forum marked as read
Highlight: NoneDon't highlight anything 5 newestHighlight 5 most recent replies
Project Analyzing Human Language Usage Shuts Down Because 'Generative AI Has Polluted the Data' (Original Post) highplainsdem Sep 19 OP
The AI Event Horizon usonian Sep 19 #1
"Damn near everything is AI-produced?" Very doubtful. Take a deep breath and get a grip. Martin68 Sep 19 #3
Interesting side effect/unforeseen consequences of AI. Martin68 Sep 19 #2
More harmful than interesting, and it was perfectly foreseeable. This has always been badly flawed highplainsdem Sep 19 #4
This article has nothing to do with either stolen intellectual property or inaccuracy. You have an axe to grind, but it Martin68 Sep 19 #5
The article, which I posted, is about the internet's pollution by AI slop "generated by large language models, written highplainsdem Sep 19 #7
Good point. The authors of the article are grinding the same axe you are. But you're missing the point. The research was Martin68 Sep 19 #14
Hardly unforeseen. The AI enthusiasts dismiss this as the "dead internet" conspiracy theory. Eugene Sep 19 #10
I believe it. My college professor friends all gripe about AI written papers. LisaM Sep 19 #6
GenAI has been terrible for teachers' morale. highplainsdem Sep 19 #8
There's a whole new crop of professors who have gone all "How I Learned to Stop Worrying and Love AI" Prairie Gates Sep 19 #9
And when they are caught, they argue and their parents back them up. LisaM Sep 19 #12
Better to spend the time devising new activities for "participation equity" Prairie Gates Sep 19 #13
My son says they're trying to crack down on AI-generated writing at his community college. He is adamantly LauraInLA Sep 19 #11

Martin68

(24,517 posts)
3. "Damn near everything is AI-produced?" Very doubtful. Take a deep breath and get a grip.
Thu Sep 19, 2024, 10:15 AM
Sep 19

Yes, there is enough AI-produced content on the internet to contaminate research into changing patterns in language usage, but that doesn't mean everything - or even a majority of things - are AI-produced. It wouldn't take much to skew research like this. Even 10% would be an issue.

highplainsdem

(52,139 posts)
4. More harmful than interesting, and it was perfectly foreseeable. This has always been badly flawed
Thu Sep 19, 2024, 10:29 AM
Sep 19

tech that

1) should never have been developed in the first place because of the vast theft of intellectual property to train the genAI models, and

2) should never have been released to the public and hyped as artificial intelligence because of the unsolvable inaccuracy of LLMs.

On top of which it's bad for the environment - our physical environment, not just our information ecosystem: https://www.democraticunderground.com/1127176439 .

Martin68

(24,517 posts)
5. This article has nothing to do with either stolen intellectual property or inaccuracy. You have an axe to grind, but it
Thu Sep 19, 2024, 10:47 AM
Sep 19

isn't relevant to the article that was posted. The subject is a study of how language changes over time, and the inclusion of any AI-generated material contaminates the significance of the research by definition. The use of intellectual material in training AI and any inaccuracy of the information posted is not related to the topic of language change over time.

highplainsdem

(52,139 posts)
7. The article, which I posted, is about the internet's pollution by AI slop "generated by large language models, written
Thu Sep 19, 2024, 11:40 AM
Sep 19

by no one to communicate nothing."

This AI slop polluting the internet was anything but unforeseen, as you suggested - the AI companies have been pushing people to use AI for writing everywhere, including social media posts - and you trivialized it by just referring to it as "interesting."

It's one of the reasons genAI is harmful. I simply mentioned a couple of others.

And btw, in the article both Robyn Speer, the creator of Wordfreq, and 404 Media editor Jason Koebler, who wrote the article, refer to the theft of intellectual property for AI companies' "plagiarism machines" - their wording. Maybe you should contact them to lecture them and tell them that what they said about it "is not related to the topic of language change over time.". They're probably unaware of rules you've set for what can be mentioned.

Martin68

(24,517 posts)
14. Good point. The authors of the article are grinding the same axe you are. But you're missing the point. The research was
Thu Sep 19, 2024, 12:48 PM
Sep 19

not compromised by plagiarism. It was compromised because AI-generated text cannot be included in a study of how the use of language by human beings changes our time. It has nothing to do with the theft of intellectual property.

Eugene

(62,627 posts)
10. Hardly unforeseen. The AI enthusiasts dismiss this as the "dead internet" conspiracy theory.
Thu Sep 19, 2024, 12:03 PM
Sep 19

The real fun begins when generative AI models train on AI bot-generated content.

LisaM

(28,558 posts)
6. I believe it. My college professor friends all gripe about AI written papers.
Thu Sep 19, 2024, 10:57 AM
Sep 19

This is probably the kind of thing that clues them in, the lack of colloquial language, the rambling logic, and the weird genericness that such writing has. Just the other the day, a friend was steeling herself to grade papers because she was so tired of reading papers written by computers.

Prairie Gates

(2,865 posts)
9. There's a whole new crop of professors who have gone all "How I Learned to Stop Worrying and Love AI"
Thu Sep 19, 2024, 11:55 AM
Sep 19

Their arguments are...not great.

Universities have gone all "Learn how to teach with AI. Join our Teaching and Learning Center brown bag series." There's a committee in the department working on it, believe that.

It's just the latest step in "Learn how to teach students who don't want to be here, aren't interested, engage them engage them engage them. Sure, let them use AI to their heart's content. Never suggest that they are perhaps doing something wrong or unethical, that's doing harm! Harm! Protect them!" The whole thing is bananas.

LisaM

(28,558 posts)
12. And when they are caught, they argue and their parents back them up.
Thu Sep 19, 2024, 12:13 PM
Sep 19

One of them said she dreads doing anything about it because then she's in for at least 25 hours of arbitration.

Prairie Gates

(2,865 posts)
13. Better to spend the time devising new activities for "participation equity"
Thu Sep 19, 2024, 12:29 PM
Sep 19


This week's brown bag: "How can we get students who show up to the Zoom late, never turn their cameras on, never say anything, and don't appear to even be at their computer to feel more welcome in the class?"

LauraInLA

(1,280 posts)
11. My son says they're trying to crack down on AI-generated writing at his community college. He is adamantly
Thu Sep 19, 2024, 12:13 PM
Sep 19

opposed to it, as he feels it prevents students from learning to write. I heartily agree, but find it hopeful that he and other kids might feel this way. He’s a true kid of tech, but I’m beginning to wonder if this generation will produce a non-Luddite but anti-tech and anti-social media movement.

Latest Discussions»Latest Breaking News»Project Analyzing Human L...