A few months back, a group of authors filed a class action lawsuit against Salesforce in federal court in the Northern District of California, alleging that the company used pirated copies of copyrighted books to train its large language models and power commercial AI products, including Agentforce. The complaint highlights a growing legal risk for companies developing or deploying generative AI: decisions about training data sources can carry significant legal and business consequences.
For business leaders, the case underscores a central point. Intellectual property law still applies, even when innovation moves faster than regulation.
What the Lawsuit Claims
The plaintiff, author Tasha Alexander, alleges that Salesforce willfully infringed registered copyrights by downloading, copying, storing, and using books to train its CodeGen and xGen series models. According to the Complaint, Salesforce relied on datasets that included “shadow library” material, such as the Books3 dataset, which has been widely associated with pirated books sourced from unauthorized repositories.
The authors allege that Salesforce neither sought permission nor paid licensing fees for the use of their works. They further claim that Salesforce’s AI products can generate content that displaces work authors would otherwise be paid to create, diminishing the market for original writing and disrupting emerging licensing markets for AI training data.
Why Training Data Is the Legal Flashpoint
At the heart of this case is a question that appears in many AI-related copyright disputes: what copies were made, and how were they obtained?
The complaint emphasizes that training large language models requires copying entire works or substantial portions of them. The plaintiffs argue that Salesforce could have lawfully licensed or purchased the books, but instead obtained them through datasets derived from pirated sources. That distinction matters because courts have increasingly suggested that acquiring training data through piracy weakens any later claim that the use is fair or transformative.
The complaint also points to Salesforce’s public descriptions of its training data, alleging that references to specific datasets were later replaced with broader language such as “publicly available sources.” Whether that shift becomes legally significant will depend on what evidence emerges during discovery.
Fair Use and Market Harm
Salesforce is likely to argue that using the copyrighted work to train AI models is transformative and qualifies as fair use. The plaintiffs strongly dispute that characterization. They contend that copying entire books to build a centralized internal library is a commercial act that is not transformative – particularly when the books were obtained from pirated sources.
The plaintiffs also focus on market harm. Their theory is not limited to whether AI outputs reproduce copyrighted text verbatim. Instead, they argue that generative AI systems can produce large volumes of substitute content at a fraction of the cost, thereby displacing demand for human-authored works and undermining legitimate licensing markets.
Practical Takeaways for Teams Building or Buying AI
This lawsuit offers several concrete lessons for companies working with generative AI:
Data provenance is a legal and business issue. Companies should understand exactly where the training data originated and whether it was lawfully obtained.
Licensing strategies matter. Courts and plaintiffs are paying close attention to whether licensing options were available and ignored.
Transparency creates accountability. Public statements about training data and AI practices can become evidence in litigation.
Upstream decisions have downstream consequences. Even if models do not store readable copies of books, the training process itself may involve legally significant reproduction.
Are the AI Winds Shifting?
The Salesforce class action reflects a broader shift in how courts, creators, and regulators are approaching AI development. As generative AI becomes embedded in commercial products, intellectual property compliance is no longer a background concern. It is a core business risk that requires deliberate strategy, governance, and oversight.
For companies investing in AI, the message is clear: innovation and compliance must move together. Treating copyrighted content responsibly is not just about avoiding lawsuits. It is about building sustainable, defensible products in a rapidly evolving legal landscape.
Have you crafted a sustainable AI strategy? We can help.
