Welcome to DU! The truly grassroots left-of-center political community where regular people, not algorithms, drive the discussions and set the standards. Join the community: Create a free account Support DU (and get rid of ads!): Become a Star Member Latest Breaking News General Discussion The DU Lounge All Forums Issue Forums Culture Forums Alliance Forums Region Forums Support Forums Help & Search

highplainsdem

(49,140 posts)
Sat Apr 6, 2024, 04:39 PM Apr 6

How Tech Giants Cut Corners to Harvest Data for A.I. (NYT)

https://www.nytimes.com/2024/04/06/technology/tech-giants-harvest-data-artificial-intelligence.html?smid=nytcore-ios-share&referringSource=articleShare&sgrp=c-cb

How Tech Giants Cut Corners to Harvest Data for A.I.
OpenAI, Google and Meta ignored corporate policies, altered their own rules and discussed skirting copyright law as they sought online information to train their newest artificial intelligence systems.

By Cade Metz, Cecilia Kang, Sheera Frenkel, Stuart A. Thompson and Nico Grant
Reporting from San Francisco, Washington and New York
April 6, 2024


-snip-

So OpenAI researchers created a speech recognition tool called Whisper. It could transcribe the audio from YouTube videos, yielding new conversational text that would make an A.I. system smarter.

Some OpenAI employees discussed how such a move might go against YouTube’s rules, three people with knowledge of the conversations said. YouTube, which is owned by Google, prohibits use of its videos for applications that are “independent” of the video platform.

Ultimately, an OpenAI team transcribed more than one million hours of YouTube videos, the people said. The team included Greg Brockman, OpenAI’s president, who personally helped collect the videos, two of the people said. The texts were then fed into a system called GPT-4, which was widely considered one of the world’s most powerful A.I. models and was the basis of the latest version of the ChatGPT chatbot.

-snip-

At Meta, which owns Facebook and Instagram, managers, lawyers and engineers last year discussed buying the publishing house Simon & Schuster to procure long works, according to recordings of internal meetings obtained by The Times. They also conferred on gathering copyrighted data from across the internet, even if that meant facing lawsuits. Negotiating licenses with publishers, artists, musicians and the news industry would take too long, they said.

-snip-



Much, much more at the link. The story of how AI companies including Google trained the AI they now hope to make billions, even trillions, from, is one of grand theft that they all knew was grand theft.

As Justine Bateman said (the quote is in the article), "This is the largest theft in the United States, period."

It's a very long article, but well worth reading in its entirety to understand that these companies were well aware they were breaking laws and violating intellectual property rights, but they chose to do so anyway as they've engaged in a crazy AI arms race to have the biggest and best AI models.

They had meetings about this being unethical and illegal. They chose to do it anyway.

I've said repeatedly in posts here that GenAI, generative AI, is FUNDAMENTALLY unethical.

Big Tech essentially set out to steal as much of our culture and knowledge as possible, for the purpose of selling it back to us, with no real intention of ever compensating all the people they stole from.

And IF you use GenAI, tools like ChatGPT and Midjourney and Copilot - with the exception of a few GenAI models with legally licensed datasets (and there's dispute about whether some of those are truly legal) - you're basically saying you're okay with that theft.
7 replies = new reply since forum marked as read
Highlight: NoneDon't highlight anything 5 newestHighlight 5 most recent replies

erronis

(15,486 posts)
1. Of course they lied, denied, stole, cheated - that's how the big companies roll
Sat Apr 6, 2024, 04:52 PM
Apr 6

If they get caught pay a few million$ in a non-inferring malfeasance fine. What's a few less lattes for the jet set?

My understanding and some knowledge of these companies (and governments) is that once they get the data, they never, ever, delete it completely. They may remove some of it from some caches or some local storage. It is backed up for millennia and will be recalled and reused whenever it serves their purposes.

highplainsdem

(49,140 posts)
3. That does not make it okay for anyone to go along with their theft. To choose to accept it.
Sat Apr 6, 2024, 05:45 PM
Apr 6

The companies need much more widespread adoption of their AI tools by paying customers to make them profitable. Those tools should be rejected.

If someone is forced to use illegally trained GenAI in their job, and they can't immediately find a more ethical job, that might be some excuse.

But there's no such excuse for people making individual choices to use it, for amusement or profit, if they're aware of the theft it's based on.

highplainsdem

(49,140 posts)
2. Verge article about this NYT expose:
Sat Apr 6, 2024, 04:53 PM
Apr 6
https://www.theverge.com/2024/4/6/24122915/openai-youtube-transcripts-gpt-4-training-data-google

The Times writes that Google’s legal department asked the company’s privacy team to tweak its policy language to expand what it could do with consumer data, such as its office tools like Google Docs. The new policy was reportedly intentionally released on July 1st to take advantage of the distraction of the Independence Day holiday weekend.

snot

(10,549 posts)
4. If Youtube's & other movies' auto-generated subtitles are any indication,
Sat Apr 6, 2024, 07:06 PM
Apr 6

Whisper is an idiot and no wonder AI is, too.

Latest Discussions»General Discussion»How Tech Giants Cut Corne...