TECH NEWS – In the face of a growing number of lawsuits, ChatGPT developer OpenAI insists that using copyrighted content to train LLMs is fair use.
Just weeks after being sued by the New York Times for copying and using “millions” of copyrighted news articles to train large language models such as ChatGPT, OpenAI said the British House of Lords’ communications and digital select committee (after The Guardian) that you have to use copyrighted material to build your systems or they won’t work. “That’s it, deal with it”.
Large-language models (LLMs), which form the basis of AI systems like OpenAI’s ChatGPT chatbot, collect huge amounts of data from online sources to “learn” how to work.
This becomes a problem when copyright issues come into play. The Times’ lawsuit, for example, says that Microsoft and OpenAI “seek to free-ride on The Times’ massive investment in its journalism by using it to build substitutive products without permission or payment.”
They are not the only ones who object to this approach. A group of 17 authors, including John Grisham and George R.R. Martin, filed a lawsuit against OpenAI in 2023, accusing it of “systematic theft on a mass scale”.
In its presentation to the House of Lords, OpenAI does not even deny the use of copyrighted material. On the contrary, he claims that it’s all fair use – and anyway, he simply has no other choice!
“Because copyright today covers virtually every sort of human expression—including blog posts, photographs, forum posts, scraps of software code, and government documents—it would be impossible to train today’s leading AI models without using copyrighted materials,” they wrote.
“Limiting training data to public domain books and drawings created more than a century ago might yield an interesting experiment, but would not provide AI systems that meet the needs of today’s citizens.”
It is questionable how convincing this argument is. For example, if someone was arrested for a bank robbery, I don’t think it would hold much weight with the cops if they told them that was the only way they were able to secure the amount of money they needed. This is admittedly a bit simplistic. It is possible that OpenAI’s lawyers could successfully argue that the unlicensed use of copyrighted material to train LLMs falls within the bounds of fair use. But the justification for using copyrighted works without the green light of the original creator ultimately boils down to “But we really, really wanted to!!”
Central to fair use is the ChatGPT developers’ position that using copyrighted material does not actually violate any rules. In its submission to the upper house, it claimed that “OpenAI complies with the requirements of all applicable laws, including copyright laws,” and expanded on that point in an update released today.
“Training AI models using publicly available internet materials is fair use, as supported by long-standing and widely accepted precedents,” OpenAI wrote. “We view this principle as fair to creators, necessary for innovators, and critical for US competitiveness.”
“The principle that training AI models is permitted as a fair use is supported by a wide range of academics, library associations, civil society groups, startups, leading US companies, creators, authors, and others that recently submitted comments to the US Copyright Office. Other regions and countries, including the European Union, Japan, Singapore, and Israel also have laws that permit training models on copyrighted content—an advantage for AI innovation, advancement, and investment.”
OpenAI said in its Upper House filing that it is “continuing to develop additional mechanisms to empower rightsholders to opt out of training,” and has agreements with various agencies, such as one it signed with the Associated Press in 2023, which it hopes will “yield additional partnerships soon.”
Let’s be honest: this seems pretty much the “I’d rather ask for forgiveness than permission” principle. Perhaps agencies and businesses would be wiser to sign some kind of agreement before a court rules that AI companies do what they want…
Source: OpenAI, The Guardian
We build AI to empower people, including journalists.
Our position on the @nytimes lawsuit:
• Training is fair use, but we provide an opt-out
• "Regurgitation" is a rare bug we're driving to zero
• The New York Times is not telling the full storyhttps://t.co/S6fSaDsfKb— OpenAI (@OpenAI) January 8, 2024
Leave a Reply