Welcome to DU! The truly grassroots left-of-center political community where regular people, not algorithms, drive the discussions and set the standards. Join the community: Create a free account Support DU (and get rid of ads!): Become a Star Member Latest Breaking News Editorials & Other Articles General Discussion The DU Lounge All Forums Issue Forums Culture Forums Alliance Forums Region Forums Support Forums Help & Search

Latest Breaking News

Showing Original Post only (View all)

highplainsdem

(52,139 posts)
Thu Sep 19, 2024, 09:33 AM Sep 19

Project Analyzing Human Language Usage Shuts Down Because 'Generative AI Has Polluted the Data' [View all]

Source: 404 Media

The creator of an open source project that scraped the internet to determine the ever-changing popularity of different words in human language usage says that they are sunsetting the project because generative AI spam has poisoned the internet to a level where the project no longer has any utility.

Wordfreq is a program that tracked the ever-changing ways people used more than 40 different languages by analyzing millions of sources across Wikipedia, movie and TV subtitles, news articles, books, websites, Twitter, and Reddit. The system could be used to analyze changing language habits as slang and popular culture changed and language evolved, and was a resource for academics who study such things. In a note on the project’s GitHub, creator Robyn Speer wrote that the project “will not be updated anymore.”

“Generative AI has polluted the data,” she wrote. “I don’t think anyone has reliable information about post-2021 language usage by humans.”

She said that open web scraping was an important part of the project’s data sources and “now the web at large is full of slop generated by large language models, written by no one to communicate nothing. Including this slop in the data skews the word frequencies.”

-snip-

Read more: https://www.404media.co/project-analyzing-human-language-usage-shuts-down-because-generative-ai-has-polluted-the-data/



Generative AI pollutes our information ecosystem, whether it's something like what this article describes, AI search returning inaccurate and often false and defamatory results, or search results for an artist's work burying their real art under AI slop copying their style. And the problem will only worsen as genAI is used more and more.

It's as damaging to our information ecosystem as pollutants and climate change are to our planet, but instead of the harm being done over decades, it's already causing enormous damage in just a couple of years.
14 replies = new reply since forum marked as read
Highlight: NoneDon't highlight anything 5 newestHighlight 5 most recent replies
The AI Event Horizon usonian Sep 19 #1
"Damn near everything is AI-produced?" Very doubtful. Take a deep breath and get a grip. Martin68 Sep 19 #3
Interesting side effect/unforeseen consequences of AI. Martin68 Sep 19 #2
More harmful than interesting, and it was perfectly foreseeable. This has always been badly flawed highplainsdem Sep 19 #4
This article has nothing to do with either stolen intellectual property or inaccuracy. You have an axe to grind, but it Martin68 Sep 19 #5
The article, which I posted, is about the internet's pollution by AI slop "generated by large language models, written highplainsdem Sep 19 #7
Good point. The authors of the article are grinding the same axe you are. But you're missing the point. The research was Martin68 Sep 19 #14
Hardly unforeseen. The AI enthusiasts dismiss this as the "dead internet" conspiracy theory. Eugene Sep 19 #10
I believe it. My college professor friends all gripe about AI written papers. LisaM Sep 19 #6
GenAI has been terrible for teachers' morale. highplainsdem Sep 19 #8
There's a whole new crop of professors who have gone all "How I Learned to Stop Worrying and Love AI" Prairie Gates Sep 19 #9
And when they are caught, they argue and their parents back them up. LisaM Sep 19 #12
Better to spend the time devising new activities for "participation equity" Prairie Gates Sep 19 #13
My son says they're trying to crack down on AI-generated writing at his community college. He is adamantly LauraInLA Sep 19 #11
Latest Discussions»Latest Breaking News»Project Analyzing Human L...