YOU’VE BEEN SERVED

On December 27th, amidst the lull between the holidays, the New York Times announced they were suing OpenAI (the creators of ChatGPT) and Microsoft (who have their own AIs and have a complex partnership agreement with OpenAI, as well as their own AI, Copilot). The alleged crime? Well, the Gray Lady contends that the defendants used their articles to train generative AI that now directly competes with the NYT. As a result, they claimed they’ve suffered billions of dollars in damages. In this article, we’ll go through this landmark case and the questions surrounding AI and copyright.

Note: unless otherwise noted, quotes come from the NYT’s complaint, accessible here.

BACKGROUND: HOW DO LLMS WORK?

The NYT’s allegations have to do with the nature of LLMs—Language Learning Models, which both ChatGPT and Copilot are. LLMs are “trained” on massive bodies of text, which enable them to gain both information and an understanding of how real people write things. Once content enters the dataset, it can be directly retrieved through certain prompts in a process called memorization. Most training datasets are now considered proprietary, meaning that the exact contents are confidential to everyone except the companies who own them.

Now, NYT articles are pretty easily accessible with a subscription. It would be easy to include them in a dataset, and it would probably be a boon to any AI—they would get content that uses professional writing standards and a lot of recent news. The complaint alleges that not only were NYT articles included as part of a training set, but developers “gave Times content particular emphasis when building their LLMs.”

FACTS OF THE CASE

This case is built on the idea that OpenAI scraped NYT content at a high level. One piece of evidence comes from the information that is publicly available about ChatGPT’s training sets. For example, GPT-3 was built off of five datasets, several of which contain a high value of NYT articles and related links. More evidence comes from memorization: prompting was able to make GPT-4 recite large portions of NYT articles without any attribution. In another case, by telling the AI that they were paywalled from reading an NYT article, a user was able to get it to provide direct excerpts from the article. This provides strong evidence that ChatGPT was trained on a large quantity of NYT content.

What could be worse than directly retrieving NYT material? Retrieving false NYT information. In other cases, when users requested excerpts from specific articles, Bing Chat made up information, quotes, and entire paragraphs that did not exist. Aside from the fact that the NYT never published this information, some of it is demonstrably false, such as the idea that orange juice is linked to Non-Hodgkins lymphoma.

LEGALESE AND CONSEQUENCES

As you might expect, the NYT is not happy about their works being retrieved for free (or false works being attributed to them). Articles are locked behind a paywall that requires a subscription, and there are even greater fees for third parties to host Times content (sometimes thousands of dollars for a year of one article!). OpenAI and Microsoft, if the allegations in this complaint are true), have bypassed copyright and created applications that both draw on their works and provide it to other people.

One element of the damages is proof that OpenAI and Microsoft have profited from possible infringement. Clearly, both companies have seen massive profits from their AI efforts and are on track to make even more as time goes on. Attributing all of it to just NYT content might be hard, but there’s a case to be made that these articles are a key part of the informational background and writing style. The complaint also stresses the fact that OpenAI and Microsoft are partners in this enterprise, which is too complex to go into here but is interesting.

What does the NYT want to happen? Right now, the damages they’re seeking are unspecified, but we can imagine, given the players, involved that it would be a lot of money. What’s more interesting are the other requests: not only do they want a permanent ban on the unauthorized use of their work for LLMs, they also demand the “destruction” of all LLMs that use their content. This would be, to say the least, earth shattering. Most of the prominent LLMs would have to shut down or vastly change their datasets to eliminate any trace of NYT information.

This case will likely not be settled for a long time, but the decision will become some of the first case law on AI and copyright infringement. You’ll want to pay attention: one day, kids might be learning this in AP Gov!