Project Analyzing Human Language Usage Shuts Down Because 'Generative AI Has Polluted the Data'
Source: 404 Media
The creator of an open source project that scraped the internet to determine the ever-changing popularity of different words in human language usage says that they are sunsetting the project because generative AI spam has poisoned the internet to a level where the project no longer has any utility.
Wordfreq is a program that tracked the ever-changing ways people used more than 40 different languages by analyzing millions of sources across Wikipedia, movie and TV subtitles, news articles, books, websites, Twitter, and Reddit. The system could be used to analyze changing language habits as slang and popular culture changed and language evolved, and was a resource for academics who study such things. In a note on the projects GitHub, creator Robyn Speer wrote that the project will not be updated anymore.
Generative AI has polluted the data, she wrote. I dont think anyone has reliable information about post-2021 language usage by humans.
She said that open web scraping was an important part of the projects data sources and now the web at large is full of slop generated by large language models, written by no one to communicate nothing. Including this slop in the data skews the word frequencies.
-snip-
Read more: https://www.404media.co/project-analyzing-human-language-usage-shuts-down-because-generative-ai-has-polluted-the-data/
Generative AI pollutes our information ecosystem, whether it's something like what this article describes, AI search returning inaccurate and often false and defamatory results, or search results for an artist's work burying their real art under AI slop copying their style. And the problem will only worsen as genAI is used more and more.
It's as damaging to our information ecosystem as pollutants and climate change are to our planet, but instead of the harm being done over decades, it's already causing enormous damage in just a couple of years.
usonian
(13,575 posts)When damn near everything is AI-produced ...
I just dunno!
Martin68
(24,517 posts)Yes, there is enough AI-produced content on the internet to contaminate research into changing patterns in language usage, but that doesn't mean everything - or even a majority of things - are AI-produced. It wouldn't take much to skew research like this. Even 10% would be an issue.
Martin68
(24,517 posts)highplainsdem
(52,139 posts)tech that
1) should never have been developed in the first place because of the vast theft of intellectual property to train the genAI models, and
2) should never have been released to the public and hyped as artificial intelligence because of the unsolvable inaccuracy of LLMs.
On top of which it's bad for the environment - our physical environment, not just our information ecosystem: https://www.democraticunderground.com/1127176439 .
Martin68
(24,517 posts)isn't relevant to the article that was posted. The subject is a study of how language changes over time, and the inclusion of any AI-generated material contaminates the significance of the research by definition. The use of intellectual material in training AI and any inaccuracy of the information posted is not related to the topic of language change over time.
highplainsdem
(52,139 posts)by no one to communicate nothing."
This AI slop polluting the internet was anything but unforeseen, as you suggested - the AI companies have been pushing people to use AI for writing everywhere, including social media posts - and you trivialized it by just referring to it as "interesting."
It's one of the reasons genAI is harmful. I simply mentioned a couple of others.
And btw, in the article both Robyn Speer, the creator of Wordfreq, and 404 Media editor Jason Koebler, who wrote the article, refer to the theft of intellectual property for AI companies' "plagiarism machines" - their wording. Maybe you should contact them to lecture them and tell them that what they said about it "is not related to the topic of language change over time.". They're probably unaware of rules you've set for what can be mentioned.
Martin68
(24,517 posts)not compromised by plagiarism. It was compromised because AI-generated text cannot be included in a study of how the use of language by human beings changes our time. It has nothing to do with the theft of intellectual property.
Eugene
(62,627 posts)The real fun begins when generative AI models train on AI bot-generated content.
LisaM
(28,558 posts)This is probably the kind of thing that clues them in, the lack of colloquial language, the rambling logic, and the weird genericness that such writing has. Just the other the day, a friend was steeling herself to grade papers because she was so tired of reading papers written by computers.
highplainsdem
(52,139 posts)Prairie Gates
(2,865 posts)Their arguments are...not great.
Universities have gone all "Learn how to teach with AI. Join our Teaching and Learning Center brown bag series." There's a committee in the department working on it, believe that.
It's just the latest step in "Learn how to teach students who don't want to be here, aren't interested, engage them engage them engage them. Sure, let them use AI to their heart's content. Never suggest that they are perhaps doing something wrong or unethical, that's doing harm! Harm! Protect them!" The whole thing is bananas.
LisaM
(28,558 posts)One of them said she dreads doing anything about it because then she's in for at least 25 hours of arbitration.
Prairie Gates
(2,865 posts)This week's brown bag: "How can we get students who show up to the Zoom late, never turn their cameras on, never say anything, and don't appear to even be at their computer to feel more welcome in the class?"
LauraInLA
(1,280 posts)opposed to it, as he feels it prevents students from learning to write. I heartily agree, but find it hopeful that he and other kids might feel this way. Hes a true kid of tech, but Im beginning to wonder if this generation will produce a non-Luddite but anti-tech and anti-social media movement.