John Carreyrou Sues OpenAI, Google, xAI Over LLM Training

The short version

On December 22, 2025, New York Times investigative reporter and Bad Blood author John Carreyrou filed a copyright lawsuit in the U.S. District Court for the Northern District of California against Anthropic, Google, OpenAI, Meta, Elon Musk’s xAI, and Perplexity. The six named plaintiffs say the companies trained and/or optimized their large language models using “pirated” copies of books from shadow libraries like LibGen and Z-Library, without permission or payment. The case—Carreyrou v. Anthropic PBC et al., No. 3:25‑cv‑10897—seeks statutory damages and an injunction, and notably is not a class action. Reuters | Complaint (PDF)

$1.5B

Largest AI‑training copyright settlementSource: Reuters-2025-12-18

What’s new—and why it matters

First suit to name xAI: Coverage of the filing emphasizes this is the first copyright case to list xAI as a defendant, reflecting how quickly newcomers are being pulled into the same legal thicket as incumbents. Reuters
Not a class action: Carreyrou and five other writers—Lisa Barretta, Philip Shishkin, Jane Adams, Matthew Sacks, and Michael Kochin—are pursuing individual claims, arguing class deals undervalue authors’ rights (they point to an estimated ~$3,000 per work in the Anthropic class settlement as “about 2%” of the statutory maximum). Complaint | Reuters
The allegation: defendants downloaded and copied “gold‑standard” book content from shadow libraries (LibGen, Z‑Library, OceanofPDF, and Books3) to train or tune chatbots and related products. Complaint

The legal landscape the case drops into

Two Northern District of California decisions from mid‑2025 set important (if contested) waypoints:

Kadrey v. Meta (June 25, 2025): Judge Vince Chhabria found Meta’s book‑based training of Llama to be “transformative” fair use on the record before him, while stressing the ruling didn’t bless AI training in general and leaving other claims (like alleged torrenting) unresolved. CNBC | The Verge
Bartz v. Anthropic (Oct–Dec 2025): The parties reached a $1.5B class settlement later scrutinized over fees; the court indicated that while some training can be fair use, downloading from pirate sources could still matter for liability or damages. Reuters

At the same time, publishers’ cases continue. The New York Times’ lawsuit against OpenAI and Microsoft survived largely intact in March 2025, keeping fair‑use questions headed toward trial in New York. Associated Press via Inquirer | OpenAI’s case page

What the companies are likely to argue

Most AI developers maintain that training on publicly available materials is fair use, often highlighting opt‑outs or licenses:

OpenAI says training on publicly available data is fair use and points to publisher opt‑outs it honors. OpenAI
Google highlights its Google‑Extended robots.txt control to let sites opt out of AI training for Gemini/Vertex. Google Blog
Meta has asserted a fair‑use defense in author suits and won a key ruling in June 2025. CNBC
Perplexity told Reuters it “doesn’t index books,” while facing separate publisher suits over news content and RAG‑style outputs. Reuters | Loeb & Loeb case note

Case details at a glance

Carreyrou v. Anthropic PBC et al. — snapshot

Field	Detail
Court	U.S. District Court, Northern District of California
Case	Carreyrou v. Anthropic PBC et al.
Number	3:25‑cv‑10897
Filed	December 22, 2025
Plaintiffs	John Carreyrou; Lisa Barretta; Philip Shishkin; Jane Adams; Matthew Sacks; Michael Kochin
Defendants	Anthropic; Google; OpenAI (and affiliates); Meta; xAI; Perplexity
Core claim	Direct copyright infringement (17 U.S.C. §501) for copying/using books from shadow libraries to train or optimize LLMs
Relief sought	Statutory damages (up to $150,000 per work, per defendant, for willful infringement), injunction, fees
Notable	Plaintiffs declined class treatment, citing low per‑work payouts in other settlements

What this could mean for AI builders and buyers

Beyond the headline, the complaint challenges the “data supply chain” that underpins generative AI. If courts accept that sourcing from shadow libraries reflects willful infringement—even where certain training might be deemed “transformative”—exposure could multiply across models, checkpoints, and products. That risk profile encourages:

Verified licensing and provenance for book‑length text, not just web crawl data.
Documentation of training inputs, fine‑tunes, and retrieval indexes separate from “core” pretraining.
Product‑level mitigations against regurgitation and quotation of long passages (which weigh against fair use).
Broader adoption of machine‑readable control/consent signals (e.g., Google‑Extended) and emerging licensing standards.

Note

Who’s sued and what they say (so far)

Defendants and public positions

Company	Named products	Allegations in complaint	Public response in this case
OpenAI	GPT‑4/4o, ChatGPT, API	Copied books from shadow libraries for training/optimization	No immediate comment to filing; says training on public materials is fair use and offers publisher opt‑outs. Reuters
Google	Gemini/Vertex	Use of Z‑Library/OceanofPDF‑derived datasets	No immediate comment; points to Google‑Extended opt‑out control for publishers. Google Blog
Meta	Llama	Trained on “shadow library” books	No immediate comment; previously prevailed on a fair‑use ruling in author suit. CNBC
Anthropic	Claude	Datasets with “hundreds of thousands” of pirated books	Settlement in separate class case; denies wrongdoing. Reuters
xAI	Grok	Large‑scale ingestion allegedly includes books	No immediate comment; says training data are filtered and not referenced post‑training. xAI FAQ
Perplexity	Answer engine (RAG)	Reliance on pirated books; output substitutes for works	Spokesperson: “doesn’t index books”; faces other news‑publisher suits. Reuters

The open questions a jury could decide

Does the source of training data matter? Early rulings suggest some training may be fair use; sourcing from pirate sites could still affect liability and willfulness damages.
How much regurgitation is too much? Academic work continues to probe when models memorize and emit protectable text, a factor that can weigh against fair use. (See, e.g., emerging memorization studies from 2025.)
What counts as “optimization” vs. “training”? The complaint also targets ingestion steps like preprocessing and retrieval‑augmented generation caches.

What to watch next

Motions to dismiss/transfer: Expect threshold challenges and early fights over discovery scope and dataset disclosures.
Interplay with other cases: NDCA’s Meta and Anthropic precedents will loom large; the New York Times case will shape the narrative on news content.
The market response: More publishers are deploying robots.txt controls (e.g., Google‑Extended) and exploring licensing frameworks, while model providers weigh re‑training costs versus settlement risk. Google Blog

Sources

Reuters: “New York Times reporter sues Google, xAI, OpenAI over chatbot training” (Dec 22, 2025). Link
Complaint, Carreyrou v. Anthropic PBC et al., 3:25‑cv‑10897 (N.D. Cal. Dec 22, 2025). PDF
Reuters: “Anthropic asks judge to slash legal fees in $1.5 billion settlement” (Dec 18, 2025). Link
CNBC: “Judge rules Meta’s use of books to train AI is fair use” (June 25, 2025). Link
The Verge: Analysis of Meta ruling and fair use caveats (June 2025). Link
OpenAI: “OpenAI and journalism” (policy/stance on training and opt‑outs). Link
Google Blog: “An update on web publisher controls” (Google‑Extended). Link
Loeb & Loeb: Dow Jones & Co. v. Perplexity AI (case summary). Link