Updates to our Terms of Use

We are updating our Terms of Use. Please carefully review the updated Terms before proceeding to our website.

Sunday, May 12, 2024 | Back issues
Courthouse News Service Courthouse News Service

Microsoft, Meta and Bloomberg accused of using pirated books in AI development

The tech giants used a dataset containing thousands of copyrighted books to train their artificial intelligence programs, a class of authors says.

MANHATTAN (CN) — A group of authors sued Microsoft, Meta and Bloomberg, joining the rafts of legal actions to stop authors' works from being used in the development of AI technology.

The class action, filed late Tuesday in the U.S. District for the Southern District of New York, is led by several authors, including former Arkansas Governor and presidential candidate Mike Huckabee, who say their copyrighted work was used without their permission in datasets used for AI technology development.

According to the complaint, the technology and media companies used a plaintext dataset, known as Books3, which contained data scraped from approximately 197,000 nonfiction books and novels published in the last 20 years to train their AI systems.

The dataset was originally compiled by independent developer Shawn Presser, along with a team of other developers, to allow any developer to create generative-AI tools.

This is often done through Large Language Models, or LLMs, which are AI systems designed to understand and generate human language.

The authors’ attorneys say LLMs are a popular development tool for companies trying to develop their AI technology.

“Developing LLMs not only increases profits by allowing companies to make new and personalized offerings to their customers, but they also save companies money by reducing their reliance on a human workforce,” the authors say in the complaint.

These models are trained on vast datasets containing text from the internet, books, articles and other sources.

Books3 was later compiled into a larger dataset called “The Pile,” which was then hosted on the internet by EleutherAI, a self-described grassroots collective of natural language processing researchers.

EleutherAI, also named in the complaint, hosted the dataset as a “free, open-source data set for the training of LLMs.”

Meta’s own tool, known as Large Language Model Meta AI, or LLaMa, used “the Pile” in its original dataset but, according to the complaint, the company did not acknowledge it contained copyrighted works.

The original release of LLaMa was powerful, according to the complaint, and Meta later announced they would be partnering with Microsoft in July 2023 specifically to compete with Open AI’s ChatGPT.

In August 2023, Meta released LLaMa 2, a new version of the model with several improvements but made no indication that “the Pile” dataset wasn’t included.

“Microsoft and Meta benefitted greatly from prior iterations of the LLM using Books3 and information from Pile because they did not have to spend additional time, money, and resources to train an LLM from scratch with their own content or properly licensed content,” the authors’ attorneys say in the complaint.

In the aftermath of the release of OpenAI GPT-3, Bloomberg similarly began working on its own LLM, which it announced this past March.

The authors say Bloomberg also used “Books3” to train its LLM in learning how to “recognize, parse, and respond in natural language.”

Shortly after the announcement, the authors say, Bloomberg said it would not use the dataset in future versions of “BloombergGPT" — a statement the authors called hollow.

“Plaintiffs’ copyright-protected works have been baked into Bloomberg’s LLM, and all subsequent versions: it cannot simply weed out the benefit it has illegally gained from scanning the text of the copyright-protected works,” they said.

As a result of the dataset being used in the initial development for these LLM models, the plaintiffs say, the copyright-protected works “now serve as a baseline for all future LLM models.”

They add: "None of the defendants sought or obtained licenses to use the copyrighted works from Books3, and the defendants knew that the parties responsible for assembling Books3 were not licensed, or otherwise legally permitted, to disseminate those works. Microsoft, Meta, and Bloomberg chose to train their LLMs using pirated and stolen
works for the purpose of making a profit. Accordingly, plaintiffs and the class were injured, and are entitled to damages."

The authors' claims include direct and vicarious copyright infringement, conversion and negligence. They seek class certification, an order barring the defendants from using their works to train AI, and damages.

They are represented by Greg Gutzler of the New York firm Dicello Levitt as well as attorneys from the Arkansas firms RMP and Poynter Law Group.

Follow @NikaSchoonover
Categories / Courts, Technology

Subscribe to Closing Arguments

Sign up for new weekly newsletter Closing Arguments to get the latest about ongoing trials, major litigation and hot cases and rulings in courthouses around the U.S. and the world.

Loading...