Today, we’re looking at a striking metaphor for how the AI industry is aggressively damaging the arts: the way Anthropic collected data to train its Claude AI model.
According to a report from Ars Technica, the startup backed by Google didn’t just take content from millions of copyrighted books, which is already a complicated legal issue.
Instead, they removed pages from books, scanned them into digital files, and then disposed of the physical pages. Saying that the AI “devoured” these books is not just a figure of speech; it’s quite literal.

This method was disclosed in a recent copyright ruling, which turned out to be a significant victory for Anthropic and the tech industry that craves data.
The judge, William Alsup, determined that Anthropic is allowed to train its large language models using books they legally purchased, even without getting permission from the authors.
This decision is partly due to Anthropic’s destructive scanning method, which, while not new, is remarkable for its scale.
It exploits a legal principle known as the first-sale doctrine, allowing buyers to do whatever they want with their purchases without interference from copyright holders.
This rule supports the secondhand market; otherwise, publishers could restrict resale or demand a share of the profits.
AI companies, however, seem to misuse this principle. Anthropic hired Tom Turvey, who previously led Google’s book-scanning project, to help them acquire “all the books in the world” without facing legal hurdles.
Turvey devised a strategy: by purchasing physical books, Anthropic could rely on the first-sale doctrine and avoid needing licenses. By tearing out the pages, they made scanning cheaper and easier.
Since they only used the scanned pages and discarded the originals, the judge deemed this process as “conserving space,” which he considered transformative and therefore legal.
This workaround is questionable and hypocritical. Initially, Anthropic resorted to downloading millions of pirated books to feed its AI. Meta also engaged in similar practices and is currently facing lawsuits from authors over this issue.
Moreover, this approach is careless. As Atlantic points out, many archivists have developed methods to scan books on a large scale without damaging the originals, like the Internet Archive and Google Books, which faced its own copyright challenges not long ago.
In pursuit of saving money and obtaining valuable training data, the AI industry seems to be running low on quality sources. This has led to the unfortunate decision to harm authors and destroy books, which, for Big Tech, appears to be a minor cost.
Other Stories You May Like