27.2 C
New York
Tuesday, July 23, 2024

The Generative AI Battle Has a Fundamental Flaw

Last week, the Authors Guild sent an open letter to the leaders of some of the world’s biggest generative AI companies. Signed by more than 9,000 writers, including prominent authors like George Saunders and Margaret Atwood, it asked the likes of Alphabet, OpenAI, Meta, and Microsoft “to obtain consent, credit, and fairly compensate writers for the use of copyrighted materials in training AI.” The plea is just the latest in a series of efforts by creatives to secure credit and compensation for the role they claim their work has played in training generative AI systems.

The training data used for large language models, or LLMs, and other generative AI systems has been kept clandestine. But the more these systems are used, the more writers and visual artists are noticing similarities between their work and these systems’ output. Many have called on generative AI companies to reveal their data sources, and—as with the Authors Guild—to compensate those whose works were used. Some of the pleas are open letters and social media posts, but an increasing number are lawsuits.

It’s here that copyright law plays a major role. Yet it is a tool that is ill equipped to tackle the full scope of artists’ anxieties, whether these be long-standing worries over employment and compensation in a world upended by the internet, or new concerns about privacy and personal—and uncopyrightable—characteristics. For many of these, copyright can offer only limited answers. “There are a lot of questions that AI creates for almost every aspect of society,” says Mike Masnick, editor of the technology blog Techdirt. “But this narrow focus on copyright as the tool to deal with it, I think, is really misplaced.”

The most high-profile of these recent lawsuits came earlier this month when comedian Sarah Silverman, alongside four other authors in two separate filings, sued OpenAI, claiming the company trained its wildly popular ChatGPT system on their works without permission. Both class-action lawsuits were filed by the Joseph Saveri Law Firm, which specializes in antitrust litigation. The firm is also representing the artists suing Stability AI, Midjourney, and DeviantArt for similar reasons. Last week, during a hearing in that case, US district court judge William Orrick indicated he might dismiss most of the suit, stating that, since these systems had been trained on “five billion compressed images,” the artists involved needed to “provide more facts” for their copyright infringement claims.

The Silverman case alleges, among other things, that OpenAI may have scraped the comedian’s memoir, Bedwetter, via “shadow libraries” that host troves of pirated ebooks and academic papers. If the court finds in favor of Silverman and her fellow plaintiffs, the ruling could set new precedent for how the law views the data sets used to train AI models, says Matthew Sag, a law professor at Emory University. Specifically, it could help determine whether companies can claim fair use when their models scrape copyrighted material. “I'm not going to call the outcome on this question,” Sag says of Silverman’s lawsuit. “But it seems to be the most compelling of all of the cases that have been filed.” OpenAI did not respond to requests for comment.

At the core of these cases, explains Sag, is the same general theory: that LLMs “copied” authors’ protected works. Yet, as Sag explained in testimony to a US Senate subcommittee hearing earlier this month, models like GPT-3.5 and GPT-4 do not “copy” work in the traditional sense. Digest would be a more appropriate verb—digesting training data to carry out their function: predicting the best next word in a sequence. “Rather than thinking of an LLM as copying the training data like a scribe in a monastery,” Sag said in his Senate testimony, “it makes more sense to think of it as learning from the training data like a student.”

Most PopularBusinessThe End of Airbnb in New York

Amanda Hoover

BusinessThis Is the True Scale of New York’s Airbnb Apocalypse

Amanda Hoover

CultureStarfield Will Be the Meme Game for Decades to Come

Will Bedingfield

GearThe 15 Best Electric Bikes for Every Kind of Ride

Adrienne So

This is pertinent to fair use, the part of US copyright law that generally protects the unlicensed use of copyrighted works for things like scholarship and research. Because if the analogy is correct, then what’s going on here is akin to how a search engine builds its index—and there’s a long history of Google using exactly this argument to defend its business model against claims of theft. In 2006 the company defeated a suit from Perfect 10, an adult entertainment site, for providing hyperlinks and thumbnails of subscriber-only porn in its search results. In 2013 it convinced a New York court that scanning millions of books, and making snippets of them available online, constituted fair use. “In my view, Google Books provides significant public benefits,” US circuit judge Denny Chin wrote in his ruling. In 2014, a judge found in favor of HathiTrust Digital Library, a spinoff of Google Books, in a similar case.

Sag reckons that defendants in similar generative AI lawsuits will use a similar augment: Yes, data goes in, but what comes out is something quite different. Therefore, while it might seem commonsensical that a human reading and a machine “reading” are inherently different activities, it’s not clear the courts will see it that way. And there’s another question mark lingering over whether a machine can make a derivative work at all, says Daniel Gervais, a professor of intellectual property and AI law at Vanderbilt University in Nashville, Tennessee: The US Copyright Office maintains that only humans can produce “works.”

If the arguments from the defense hold, then there’s the matter of where those books came from. Several of the experts WIRED spoke to agree that one of the more compelling arguments against OpenAI centers on the secretive data sets the company allegedly used to train its models. The claim, appearing verbatim in both of the recent lawsuits, is that the Books2 data set, which the lawsuits estimate contains 294,000 books, must, by its very size, hold pirated material. “The only internet-based books corpora that has ever offered that much material are notorious ‘shadow library’ websites like Library Genesis (aka LibGen), Z-Library (aka B-ok), Sci-Hub, and Bibliotik,” the lawsuits claim.

The reason OpenAI would plunder pirated data is simple: These sites contain a bounty of the highest-quality writing, on a massive range of subjects, produced by a diverse range of authors. Sag argues that the use of copyrighted works such as books may have helped make LLMs “more well-rounded,” something that may have been difficult if, say, they were only trained on Reddit posts and Wikipedia articles.

There's no precedent in the US that directly links fair use with whether the copyrighted works were obtained legally or not. But, says Sag, there's also no stipulation that unlawful access is irrelevant in such cases. (In the European Union, it's stipulated that data-mining operations must get legal access to the information they use.)

Most PopularBusinessThe End of Airbnb in New York

Amanda Hoover

BusinessThis Is the True Scale of New York’s Airbnb Apocalypse

Amanda Hoover

CultureStarfield Will Be the Meme Game for Decades to Come

Will Bedingfield

GearThe 15 Best Electric Bikes for Every Kind of Ride

Adrienne So

One way to look at this problem is to claim that lawful access is irrelevant to inspiration, an argument Masnick recently made on Techdirt. “If a musician were inspired to create music in a certain genre after hearing pirated songs in that genre, would that make the songs they created infringing?” he wrote.

Masnick’s worry is that some stricter imagining of copyright infringement, aiming to rein in generative AI, could have an unintended chilling effect on creativity. Earlier this year, the US Copyright Office launched an initiative to investigate AI issues. “I fear that saying ‘we can’t learn from these other artists without compensating them,’ creates really big problems for the way that that art is created and the way that content creators learn,” he says. “The normal way that content creators of all stripes become their own content creators is they see someone else and they are inspired by them.”

On the other hand, if someone spends years writing a novel, shouldn't copyright ensure that they are compensated if someone else uses their works for commercial purposes? “You could frame this as undermining the incentives of the copyright system,” says Sag. Simply put, if generative AI systems can scrape copyrighted works without compensating writers and churn out something in a similar style, does that lower the incentives for people to create such works in the first place?

These lawsuits, even if they are unsuccessful, are likely to provoke generative AI companies into taking steps to avoid them. These steps are unlikely to make happy reading for artists. These firms could, for example, obtain licensing agreements to use copyrighted works in their training data. It's been widely reported that this would be analogous to how, say, Spotify licenses music—albeit on controversial terms—in a way the original version of Napster didn’t. Drake, for example, could license out his discography so fans can conjure Drake-like AI croonings of their own.

Another possible future sees artists asked to opt in to allowing their work to be used as training data. Roblox, which has been cautious with its in-house tools, is considering a model like this for content made by its users, while Adobe has been similarly careful with Firefly, training it on Adobe Stock images and licensed and public domain content. The Associated Press also recently announced a deal to license its news stories to OpenAI.

Ultimately, though, the technology is not going away, and copyright can only remedy some of its consequences. As Stephanie Bell, a research fellow at the nonprofit Partnership on AI, notes, setting a precedent where creative works can be treated like uncredited data is “very concerning.” To fully address a problem like this, the regulations AI needs aren't yet on the books.

Related Articles

Latest Articles