2 authors say OpenAI 'ingested' their books to train ChatGPT. Now they're suing, and a 'wave' of similar court cases may follow.

L4sBot@lemmy.world · 2 years ago

2 authors say OpenAI 'ingested' their books to train ChatGPT. Now they're suing, and a 'wave' of similar court cases may follow.

OldGreyTroll@kbin.social · 2 years ago

If I read a book to inform myself, put my notes in a database, and then write articles, it is called “research”. If I write a computer program to read a book to put the notes in my database, it is called “copyright infringement”. Is the problem that there just isn’t a meatware component? Or is it that the OpenAI computer isn’t going a good enough job of following the “three references” rule to avoid plagiarism?

ash@lemmy.fmhy.ml · 2 years ago

I honestly do not care whether it is or is not copyright infringment, just hope to see “AI” burn :3

Dav@kbin.social · 2 years ago

AI isnt a boogyman, it’s a set of tools. No chance it’s going away even if Open AI suddenly disappeared.

ash@lemmy.fmhy.ml · 2 years ago

I understand, but I will continue to stubbornly dislike LLMs.

Lydia_K@lemmy.world · 2 years ago

Can I ask why you feel that way?

ash@lemmy.fmhy.ml · 2 years ago

I dislike general artificial intelligence. I understand that it can be a useful tool, but at the same time the thought of being in a world where people’s jobs can be replaced with robots for the sake of profit and you won’t be able to tell whether you are talking with a real person or not repulses me.

Lydia_K@lemmy.world · 2 years ago

Well, while I do agree that it sucks that some jobs may get replaced history has shown that it always leads to creating more jobs in place. The weavers lost their jobs when the loom came about, but far more jobs were created because of it, same with the printing press and every other advancement, the nature of advancing technology is to replace the old with the new.

Ugh, the robot phone calls are going to get a hundred times worse, that one is true, I’m not sure if it’ll make the standard corporate phone maze better or worse, maybe better because at least you can screw with the robot while you wait instead of having the same 30 seconds of highly compressed garbage elevator music blasted into your ear on repeat.

bioemerl@kbin.social · 2 years ago

Yeah. There are valid copyright claims because there are times that chat GPT will reproduce stuff like code line for line over 10 20 or 30 lines which is really obviously a violation of copyright.

However, just pulling in a story from context and then summarizing it? That’s not a copyright violation that’s a book report.

Wander@kbin.social · edit-2 2 years ago

Say I see a book that sells well. It’s in a language I don’t understand, but I use a thesaurus to replace lots of words with synonyms. I switch some sentences around, and maybe even mix pages from similar books into it. I then go and sell this book (still not knowing what the book actually says).

I would call that copyright infringement. The original book didn’t inspire me, it didn’t teach me anything, and I didn’t add any of my own knowledge into it. I didn’t produce any original work, I simply mixed a bunch of things I don’t understand.

That’s what these language models do.

nlogn@lemmy.world · 2 years ago

Or is it that the OpenAI computer isn’t going a good enough job of following the “three references” rule to avoid plagiarism?

This is exactly the problem, months ago I read that AI could have free access to all public source codes on GitHub without respecting their licenses.

So many developers have decided to abandon GitHub for other alternatives not realizing that in the end AI training can safely access their public repos on other platforms as well.

What should be done is to regulate this training, which however is not convenient for companies because the more data the AI ingests, the more its knowledge expands and “helps” the people who ask for information.

bioemerl@kbin.social · 2 years ago

It’s incredibly convenient for companies.

Big companies like open AI can easily afford to download big data sets from companies like Reddit and deviantArt who already have the permission to freely use whatever work you upload to their website.

Individual creators do not have that ability and the act of doing this regulation will only force AI into the domain of these big companies even more than it already is.

Regulation would be a hideously bad idea that would lock these powerful tools behind the shitty web APIs that nobody has control over but the company in question.

Imagine the world is the future, magical new age technology, and Facebook owns all of it.

Do not allow that to happen.

Kilamaos@lemmy.world · 2 years ago

Plus, any regulation to limit this now means that anyone not already in the game will never breakthrough. It’s going to be the domain of the current players for years, if not decades. So, not sure what’s better, the current wild west where everyone can make something, or it being exclusive to the already big players and them closing the door behind

SirGolan@lemmy.sdf.org · 2 years ago

My concern here is that OpenAI didn’t have to share gpt with the world. These lawsuits are going to discourage companies from doing that in the future, which means well funded companies will just keep it under wraps. Once one of them eventually figures out AGI, they’ll just use it internally until they dominate everything. Suddenly, Mark Zuckerberg is supreme leader and we all have to pledge allegiance to Facebook.

mydataisplain@lemmy.world · 2 years ago

Is it practically feasible to regulate the training? Is it even necessary? Perhaps it would be better to regulate the output instead.

It will be hard to know that any particular GET request is ultimately used to train an AI or to train a human. It’s currently easy to see if a particular output is plagiarized. https://plagiarismdetector.net/ It’s also much easier to enforce. We don’t need to care if or how any particular model plagiarized work. We can just check if plagiarized work was produced.

That could be implemented directly in the software, so it didn’t even output plagiarized material. The legal framework around it is also clear and fairly established. Instead of creating regulations around training we can use the existing regulations around the human who tries to disseminate copyrighted work.

That’s also consistent with how we enforce copyright in humans. There’s no law against looking at other people’s work and memorizing entire sections. It’s also generally legal to reproduce other people’s work (eg for backups). It only potentially becomes illegal if someone distributes it and it’s only plagiarism if they claim it as their own.

Grandwolf319@sh.itjust.works · 2 years ago

This makes perfect sense. Why aren’t they going about it this way then?

My best guess is that maybe they just see openAI being very successful and wanting a piece of that pie? Cause if someone produces something via chatGPT (let’s say for a book) and uses it, what are they chances they made any significant amount of money that you can sue for?

mydataisplain@lemmy.world · 2 years ago

It’s hard to guess what the internal motivation is for these particular people.

Right now it’s hard to know who is disseminating AI-generated material. Some people are explicit when they post it but others aren’t. The AI companies are easily identified and there’s at least the perception that regulating them can solve the problem, of copyright infringement at the source. I doubt that’s true. More and more actors are able to train AI models and some of them aren’t even under US jurisdiction.

I predict that we’ll eventually have people vying to get their work used as training data. Think about what that means. If you write something and an AI is trained on it, the AI considers it “true”. Going forward when people send prompts to that model it will return a response based on what it considers “true”. Clever people can and will use that to influence public opinion. Consider how effective it’s been to manipulate public thought with existing information technologies. Now imagine large segments of the population relying on AIs as trusted advisors for their daily lives and how effective it would be to influence the training of those AIs.

ThoughtGoblin@lemm.ee · 2 years ago

AI could have free access to all public source codes on GitHub without respecting their licenses.

IANAL, but aren’t their licenses are being respected up until they are put into a codebase? At least insomuch as Google is allowed to display code snippets in the preview when you look up a file in a GitHub repo, or you are allowed to copy a snippet to a StackOverflow discussion or ticket comment.

I do agree regulation is a very good idea, in more ways than just citation given the potential economic impacts that we seem clearly unprepared for.

qwertyqwertyqwerty@lemmy.one · 2 years ago

I’d say the main difference is that AI companies are profiting off of the training material, which seem unethical/illegal.

magic_lobster_party@kbin.social · 2 years ago

The fear is that the books are in one way or another encoded into the machine learning model, and that the model can somehow retrieve excerpts of these books.

Part of the training process of the model is to learn how to plagiarize the text word for word. The training input is basically “guess the next word of this excerpt”. This is quite different compared to how humans do research.

To what extent the books are encoded in the model is difficult to know. OpenAI isn’t exactly open about their models. Can you make ChatGPT print out entire excerpts of a book?

It’s quite a legal gray zone. I think it’s good that this is tried in court, but I’m afraid the court might have too little technical competence to make a ruling.

nyakojiru@lemmy.dbzer0.com · edit-2 2 years ago

What about… they are making billions from that “read” and “storage” of information copyrighted from other people. They need to at least give royalties. This is like google behavior, using people data from “free” products to make billions. I would say they also need to pay people from the free data they crawled and monetized.