Anthropic destroyed millions of physical books to train its AI, court documents reveal

WTF?! Generative AI has already faced sharp criticism for its well-known issues with reliability, its massive energy consumption, and the unauthorized use of copyrighted material. Now, a recent court case reveals that training these AI models has also involved the large-scale destruction of physical books.

Buried in the details of a recent split ruling against Anthropic is a surprising revelation: the generative AI company destroyed millions of physical books by cutting off their bindings and discarding the remains, all to train its AI assistant. Notably, this destruction was cited as a factor that tipped the court’s decision in Anthropic’s favor.

To build Claude, its language model and ChatGPT competitor, Anthropic trained on as many books as it could acquire. The company purchased millions of physical volumes and digitized them by tearing out and scanning the pages, permanently destroying the books in the process.

Furthermore, Anthropic has no plans to make the resulting digital copies publicly available. This detail helped convince the judge that digitizing and scraping the books constituted sufficient transformation to qualify under fair use. While Claude presumably uses the digitized library to generate unique content, critics have shown that large language models can sometimes reproduce verbatim material from their training data.

Anthropic’s partial legal victory now allows it to train AI models on copyrighted books without notifying the original publishers or authors, potentially removing one of the biggest hurdles facing the generative AI industry. A former Metal executive recently admitted that AI would die overnight if required to comply with copyright law, likely because developers wouldn’t have access to the vast data troves needed to train large language models.

Still, ongoing copyright battles continue to pose a major threat to the technology. Earlier this month, the CEO of Getty Images acknowledged the company couldn’t afford to fight every AI-related copyright violation. Meanwhile, Disney’s lawsuit against Midjourney – where the company demonstrated the image generator’s ability to replicate copyrighted content – could have significant consequences for the broader generative AI ecosystem.

That said, the judge in the Anthropic case did rule against the company for partially relying on libraries of pirated books to train Claude. Anthropic must still face a copyright trial in December, where it could be ordered to pay up to $150,000 per pirated work.

Source link