When it comes to hot button issues at the intersection of technology and intellectual property, Artificial Intelligence has taken the cake in the past few years. This blog has already featured several news stories about lawsuits against OpenAI for its products ChatGPT and Copilot AI. Now, we add another lawsuit to the pile – the Center of Investigative Reporting (CIR) and Mother Jones have sued OpenAI and Microsoft for copyright infringement.
The Claims & the Dangers
Not surprising for a journalistic nonprofit, CIR’s complaint outlines its copyright and social problems with OpenAI’s training methods in great detail. CIR takes issue with the fact that OpenAI trains its Large Language Model (LLM) systems using CIR’s copyrighted work without permission.
OpenAI feeds large chunks of accessible online text to the LLMs that power ChatGPT, training the machine to predict the most likely word to come next in a sequence, given a set of parameters. For example, if you tell ChatGPT to write a poem in the style of Emily Dickinson, it will analyze the occurrence of certain words in her works and put out a string of words that are likely to occur in one of her poems. Over time, the results have become startlingly resonant.
However, news outlet and author plaintiffs against OpenAI have pointed out the rampant plagiarism in ChatGPT’s outputs – as evidenced by Exhibit J in the New York Times lawsuit, titled “ONE HUNDRED EXAMPLES OF GPT-4 MEMORIZING CONTENT FROM THE NEW YORK TIMES.”
In the case of Mother Jones, the magazine commenced printing in 1978 and moved to online publishing in 1993, then merged with CIR in early 2024. Taking up the mantle of this lawsuit for both of them, CIR posits that OpenAI included Mother Jones’s online articles in its data sets for training, which OpenAI assembled under the moniker WebText. To voice CIR’s central claim straight from the complaint:
“The OpenAI Defendants have published a list of the top 1,000 web domains present in the WebText training set and their frequency. According to that list, 16,793 distinct URLs from Mother Jones’s web domain appear in WebText.”
OpenAI did not license these articles from CIR or Mother Jones.
CIR draws a clear connection between intellectual property and its ability to continue supporting the center’s endeavors. Without the exclusivity of copyright protection or even overt citation of ChatGPT’s sources, Monika Bauerlein, the CEO of CIR, voiced deep concern that OpenAI could “cut the entire foundation of our existence as an independent newsroom out from under us.”
In other words, publications need relationships with their readers to keep the lights on through subscriptions, advertising, and donations. ChatGPT intercepts and obscures that relationship.
In contrast, some news organizations, like Time and AP News, have decided to negotiate deals with OpenAI, licensing their archives to OpenAI in exchange for compensation.
OpenAI Calls Fair Use
OpenAI’s response to this specific lawsuit has yet to surface, but its general rebuttal to lawsuits from news publications and authors has leaned on the Fair Use Doctrine.
Fair Use does allow for some copyrighted works to be used without payment or permission in non-commercial circumstances that serve the wider good. For instance, Fair Use allows teachers to photocopy pages from a novel to share the prose with their classrooms. The infringement is allowed for educational purposes.
Fair Use also protects commentary and parody, so that creators can reference and discuss copyrighted works without needing to acquire licenses – as one can imagine, the licensees might not readily agree if they know the purpose is to criticize or satirize their work.
Just as Fair Use softens the rigidity of copyright law, the doctrine itself is open to interpretation. Judges use their own professional judgment to determine if Fair Use claims are legitimate or not, assessing four factors:
- Purpose and character of the use, including whether the use is of a commercial nature or is for nonprofit educational purposes: Is the defendant making money from their infringing activities?
- Nature of the copyrighted work: How much is the infringed-upon work a unique, creative expression?
- Amount and substantiality of the portion used in relation to the copyrighted work as a whole: Did the infringer use just a smidge of the work to get their point across, or are they quoting so much that it begins to border on plagiarism?
- Effect of the use upon the potential market for or value of the copyrighted work: Does the infringing work distract consumers from the copyrighted work, reducing purchase power?
The sticking point for OpenAI may be the commercial nature of ChatGPT. While it is true that OpenAI began as a nonprofit in 2015, it created OpenAI LP in March 2019, which is undeniably a for-profit institution, with revenues reported in the billions.
The Future Relationship of AI and Publications
From these facts and claims, several questions hang in the balance: Will the judge in the Southern District of New York see OpenAI’s early research as protected by Fair Use? Or will the company’s later profitability eclipse the original nonprofit’s character of use?
And could authors and news outlets find ways to collaborate with AI tech companies rather than fight them – or have their relationships been too strained by tech’s propensity to ask for forgiveness rather than permission?
I cannot claim to know all the answers to these important questions, but I will be keeping a sharp eye on the courts as multiple AI cases play out.