AI Training and Fair Use: The Debate
A balanced view of the debate, the law, and where this could all lead
Fair use is a famously vague area of copyright law, with Circuit Court judges calling the area “unprincipled and unpredictable.”1 Generally, fair use is judged by a four factor test comprising: (1) the purpose and character of the use, (2) the nature of the copyrighted work, (3) the amount and substantiality of the portion taken, and (4) the effect of the use upon the potential market. This test was originally formulated by Joseph Story in Folsom v. Marsh 9 F.Cas. 342 (C.C.D. Mass. 1841) but has since been codified in the Copyright Act of 1976 under 17 U.S.C. § 107, with a mandate from Congress that fair use should “adapt” to “rapid technological change.” As you can see, all of these factors are subjective, speculative, or both. This can lead to surprising results: collage is not inherently fair use while fan fiction is usually fair use. To add to the complexity, there are certain types of fair use with their own doctrine, like parodies and databases.
The result has been a large amount of litigation against AI companies. Sarah Silverman has sued Meta, the New York Times has sued OpenAI, and many other lawsuits of similar consequence have been filed even if they failed to get mainstream attention. Tech companies that provide AI systems are concerned enough that big AI companies, like Microsoft, are even starting to indemnify customers for any copyright risk associated with this technology. So this week in Nonobvious, we are going to dive into these cases and give a balanced overview of how the AI fair use issue could turn out, given the unpredictability of fair use.2
Is Learning Fair?
When it comes to training and fair use, there are three distinct issues:
Whether the copying of certain texts to train on is infringement
Whether the training and use (“inference”) of models are infringing uses
Whether the entity responsible is the foundational model provider or the user
Much ink has been spilled on fair use doctrine generally. My personal favorite article on the topic with machine learning specifically is Fair Learning by Mark Lemley and Bryan Casey in the Texas Law Review, which analyzes the fair use landscape and argues not only that machine learning is likely fair use under existing law but also advocates that this is a better policy. I won’t go into the four factors and write a mock opinion, or review a particular case (like Cecilia Zeniti did for NYT v. OpenAI). Instead, this week’s Nonobvious will dive into these three issues, highlighting relevant cases and analyzing the fair use factors that are most likely to be relevant for each issue.3
Issue 1: Training
The first issue is whether literal copying for the purpose of training a model is copyright infringement. Courts have held in the past that literal copying, even of substantial portions of copyrighted material, is permissible if it is an intermediate step towards the creation of a transformative product. The core holding here is Sega Enterprises, Ltd. v. Accolade, Inc., 977 F.2d 1510 (9th Cir. 1992), which held that the reverse-engineering of code was held not to be copyright infringement even though it meant that Accolade copied significant parts of Sega’s code in order to create software that was compatible with Sega’s console. Notably, this case is from the Ninth Circuit, where a majority of cases involving large language models, or LLMs, have been filed. Similarly, changing the format of a device, like from a CD to an MP3, has been held to be “space shifting” and not infringing, for example in Recording Indus. Ass’n of Am. v. Diamond Multimedia Sys., Inc., 180 F.3d 1072, 51 U.S.P.Q.2d 1115 (9th Cir. 1999).
Note here that, to me, it is exceedingly obvious that training AI models is a transformative use, and so I am not going into the question of whether training itself is infringement. The main question for this issue is whether a license was required to access or use the training material.
There are also other doctrines that apply here—for example, most major AI models have trained off of publicly available material, text which they own, and public databases containing legally obtained works for which the Right of First Sale likely applies (or, at the very least, where tech companies are not the original infringer); in other words, for much of the training material, there is no copyright infringement to be found separate from the question of fair use. The fundamental issue is that copyright is fundamentally about controlling the copying and distribution of content, not its use once a copy has been sold. That is more like the negative right of a patent, and while copyright holders like news outlets may dislike those policy implications, that is what copyright law has been about for hundreds of years.
Of all of the issues, this is likely the one with the least commercial relevance. Training is a one-off event with respect to training data. Once a machine learning model is trained, it is then used for inference purposes. Furthermore, the damages would likely be quite low in and of themselves given the low cost of the relevant copyrighted works, though the statutory damages could be severe. Given the existing case law, there are almost certainly other ways of obtaining the same data that would be permissible, even if it is held that using a dataset of available internet materials like Common Crawl is impermissible.
Issue 2: Inference
The second question is whether the training and inference of of the models is an infringing use. There are two flavors to this question.
The first version is the contention that all inference, because it is based on training data, is infringing and non-transformative use. This is difficult to square with existing caselaw. The most damning case for copyright holders is Authors Guild v. Google, Inc., 721 F.3d 132 (2nd Cir. 2015), also known as the Google Books case, where the Second Circuit held that Google’s book product did not engage in copyright infringement, neither for the literal copying of books to create its database nor for its “snippet” feature that allowed users to search a book for any passage and display that snippet in the context of the book without recreating the entire book. It is difficult to see how inference is not more transformative than this use; if so, it would be difficult to not see how inference would be noninfringing by something resembling the transitive property.
Although Google Books is possibly claim-ending for the New York Times, especially since both are in the Second Circuit, it does not bind the Ninth Circuit. Despite that, this argument seems to also be losing in court. The Silverman lawsuit against Meta over its LLaMA system, for example, was dismissed on the grounds that there was no substantial similarity found in the works. Similarly, a less buzzy lawsuit against text-to-image providers, like StabilityAI, was dismissed almost in its entirety, only leaving a single copyright infringement claim. If a use is transformative and stands on its own that is often enough to find fair use. For example, a Harry Potter encyclopedia was held to constitute fair use even though the characters and facts are, obviously, copyrighted, in the case Warner Bros. Entertainment, Inc. v. RDR Books, 575 F.Supp.2d 513 (S.D.N.Y. 2008).
Although much of the public debate among lawyers has treated transformative use as determinative, it is not actually dispositive. In the most recent Supreme Court docket, the Court held in Andy Warhol Foundation for the Visual Arts, Inc. v. Goldsmith et al., 598 U.S. ___ (2023) that transformative use was not a catch-all that could dismiss any copyright infringement claim, in this case saying that a Warhol painting based on a picture was transformative but still infringing because it used the photograph as a base for Warhol’s art. So for foundational model providers, they should not assume that a determination that inference is transformative will guarantee a finding that they therefore did not infringe. In copyright law, no case, even one as important as the Google Books case, is the final word. One way this could go against the LLM providers would be for courts to say that while the LLM itself is a transformative use, individual examples of inference are not.
The commercial impact is often the most important factor in copyright infringement analyses, and that will likely be true here as well. It was the instrumental factor in Hustler Magazine, Inc. v. Moral Majority, Inc., 606 F.Supp. 1526 (C.D. Cal., 1985), for example, which involved not only copying but distributing an article in Hustler Magazine for a political campaign. For copyright holders like the New York Times, it is highly unlikely that ChatGPT will have any impact on the market for back issues of their newspaper. But for cases against image generators there is a stronger case that tools like DALL-E reduce the demand for humans who create, for example, marketing images. More broadly, there are fair use cases that focus on the demand for an individual work, like in Hustler, and cases that focus on the overall demand in a market, like in the Google Books case. It is unclear which standard a judge would apply in this context, but because AI is a broad enabling technology product, it seems more likely that the latter would be applied. Even then, it is not clear which way the court would hold, but this is one of the biggest risks for foundational model providers.
And indeed, the second flavor of this question is “regurgitation,” or the ability to cause an AI system to reproduce part of its training set. The now-famous Exhibit J of the New York Times lawsuit was by far the most stunning part of any AI lawsuit to date: the Times showed that it could cause ChatGPT to reproduce sections of its published Times articles, which is in sharp contrast to the dismissed Stable Diffusion case, where federal Judge William Orrick wrote that “none of the Stable Diffusion output images provided in response to a particular Text Prompt is likely to be a close match for any specific image in the training data.”4
One form of complication is the specific type of use of AI models required to produce regurgitation. Producing these regurgitations required the New York Times to enter several sentences of Times articles, some of which may be accessible without a paywall. On the one hand, this is a form of copying and may use parts of the article that are visible to non-subscribers. On the other hand, this is a very contrived type of prompt that is unlike the ways that someone would actually use the product. And indeed, OpenAI responded this week to the Times lawsuit, alleging that Exhibit J was deceptive—putting ellipses in (in)convenient places to make OpenAI’s output appear sequential despite not producing a literal copy and alleging that the New York Times cherry-picked embarrassing results out of “tens of thousands of attempts,” among other issues. The first question is whether OpenAI’s claims are true, but the second question is whether this even matters. Copyright is a strict liability regime, and the standard is “substantial” similarity. More on this below.
There is the additional, sci-fi question of whether these copies are even being stored at all. LLMs do not store literal copies of works in a database; rather, they train on the works to decipher patterns that are embedded as representations, that is, networks of statistical weights that connect certain words together,5 in a process called “embedding.” Though this is speculative, a court could decide that this is an important element because infringement requires access in addition to substantial similarity. While the latter is not much in doubt, the technical matter of embedding may mean that there is no access.
Adding to the potential of an infringement finding, Google Books is not even the last word on transformative use in the Second Circuit. Associated Press v. Meltwater U.S. Holdings, 931 F. Sup. 2d 537 (S.D.N.Y. 2013) held that a product that sent snippets of AP newswires was infringing. The main difference from Google Books seems to have been the commercial impact factor, where Meltwater attempted to compete directly with the AP service. Similarly, Fox News. v. TVEyes, 883 F.3d 169 (2d. Cir. 2018) held that a product that allowed customers to share clips from TV stations was not fair use even though the search function was, focusing on the market impact as the “most important” factor, in part because TVEyes produced 10-minute clips that could be stitched together to, effectively, reproduce an entire TV program without any safeguards and indexed each specific show for users to easily find. These cases can also be read as helpful or harmful for the case of companies like OpenAI: on the one hand, it seems like it is possible to use LLMs to create infringing content, like images of copyrighted cartoon characters. On the other hand, most LLM providers have protections in place and do not offer a selection of copyrighted material on tap like TVEyes and Meltwater did. This speaks to the unpredictability of fair use and how transformative a work has to be to receive fair use protection, which is likely the other biggest risk for tech companies.
Issue 3: Responsibility
This leads us to our last issue: who is responsible. This final question may also potentially be the most important issue even though it has been the least discussed. The iconic Supreme Court case on infringement-enabling technologies is Sony Corp. of America v. Universal City Studios, Inc., 464 U.S. 17 (1984), also known as the Betamax case which held that Sony was not liable for infringement for its Betamax technology primarily because of the highly personal use of the machine. Yet, in A&M Records, Inc. v. Napster, Inc., 239 F.3d 1004 (9th. Cir. 2001), the Ninth Circuit held that the Betamax case didn’t protect Napster because the principle purpose of Napster as a service was copyright evasion; however, even then, the Ninth Circuit did not hold that P2P torrenting was itself an infringing technology even though it involves copying. In other words, the courts have drawn a distinction between infringing technologies and infringing services.
Although copyright is a strict liability regime, in practice, enabling technologies may enjoy protection for want of volitional conduct in the chain of infringement. The key to this issue is that even if the courts hold that there can be infringement in the case of inference, they may determine that the responsibility lies not with foundational model providers, who are tool providers, but rather with the customer who uses the tool for an infringing use. In that case, the courts would be holding that foundational models are more like Betamax. But if they don’t, it would require them to decide that they are more like Napster.
The foundational concept here, as mentioned above, is “volitional conduct.” Courts have generally held that to be found liable for copyright infringement, essentially, there must be an affirmative act taken by the alleged infringer. As the court of Cartoon Network, LP v. CSC Holdings, Inc., 536 F.3d 121 (2nd Cir. 2008) put it, “when there is a dispute as to the author of an allegedly infringing instance of reproduction, Netcom and its progeny direct our attention to the volitional conduct that causes the copy to be made.” And although copyright infringement does not formally include intent as a factor, in a practical sense this was the main difference between Betamax and Napster. Sony created a technology to allow people to enjoy content in their homes; Napster proudly made it easy for people to avoid buying CDs. The Ninth Circuit even specifically pointed to Napster’s conduct and blatant disregard for copyright infringement in their opinion.
This particular factor is likely to play in OpenAI’s favor. It has stated publicly that it treats regurgitation as a bug it is working to “drive to zero.” The fact that the bug is “rare” is also possibly leading to an attempted de minimis defense, wherein the alleged infringer finds relief because they display such a small portion of the work, as in Sandoval v. New Line Cinema Corp., 147 F.3d 215 (2d. Cir. 1998); this was also a key factor in the Google Books case. Although foundational model providers undoubtedly engage in volitional conduct when they select training materials, it is far less clear that they engage in volitional conduct in the inference, or use, of their products, especially because the user must specifically prompt a system to create an infringing result.
This issue is also quite pivotal because the tenor of any court ruling here will have a great impact on what the world of LLMs will look like, regardless of which party wins. If LLM providers are held to be totally off the hook, even for infringing uses of their technology, it may result in tools that have fewer safeguards. After all, you are allowed to draw a picture of Mario at home for your personal use; you just can’t create a Mario video game in order to sell it. On the other hand, if courts say that LLM technology isn’t inherently infringing but that providers are on the hook to minimize these errors, we may say commercial LLM services become even more restrictive of the use of copyrighted (and trademarked!) content, even for non-infringing use cases like fan fiction. The DMCA causes platforms like YouTube to be so overly cautious that they will initially comply with almost any takedown notice; similarly, AI companies may become excessively cautious, for example refusing to identify a character from a copyrighted movie.6
As a practical matter, it seems that there will likely be an international escape hatch if any company is found liable. Japan’s government recently declared that AI training is not copyright infringement, while in some countries like the UK it is still unclear. Even if American courts decide to be hostile to machine learning and training, it may be possible to just train a model with servers located in Japan or another friendly jurisdiction.
Weekly Novelties
The head of the EPO pleaded with the European Commission to “press pause” on their new SEP rules (Euro News)
At the same time, the EU is pursuing a Unitary Patent, which the head of the EPO called a “game changer” (Euro News)
The Federal Circuit held that the PTAB retains its authority to issue a Final Written Decision even after the statutory deadline has passed in Purdue Pharma, L.P. v. Collegium Pharmaceutical, Inc., Case No. 22-1482 (Fed. Cir. 2023), and that decision will be issued soon (Morgan Lewis)
In Canada, the first case has come down for the “due care” standard that allows patentees, in limited circumstances, to revive a patent where they failed to pay maintenance fees. The bar is high, but it is there (Norton Rose Fulbright)
PTAB denied institution of a petition for IPR on the basis that information on the Wayback Machine wasn’t “publicly accessible” even though the Wayback Machine is on the internet (JD Supra)
A profile of the TD Commons, where inventors open-source inventions without a patent as prior art to fight patent trolls from over 150 organizations. Interestingly, it is funded primarily by Google (Wired Magazine)
To be fair to fair use, it does have its defenders!
In the interest of full disclosure, I have criticized these suits before as unlikely to succeed. My personal view on the policy of this topic is close to Vinod Khosla’s (but slightly less extreme). However, like anyone with a JD, I am capable of putting my views aside to assess a legal matter objectively. That is the purpose of this piece: to share my view of what the law says and which claims are most likely to succeed. There has also been a lot of non-expert opinion, for example this piece by Gary Marcus. My aim is to put out a useful article.
In the interest of further disclosure, you should also know that I lean towards legal realism.
This is somewhat contradicted by the Times later remarking that inaccuracies and hallucinations could tarnish their brand, but I digress.
There is one case dealing with this topic in another category: thumbnails. In one case, thumbnails were held to constitute fair use because they were intended to help make the works easier to find and were lower quality. Kelly v. Arriba-Soft, 336 F.3d. 811 (9th Cir. 2003). This is similar in the sense that it addresses the fair use surrounding compression, which is a statistical representation of a copyrighted work (in this case, an image). However, it is different in that it addresses a different use case, an algorithm that results in a model that is closer to the original file, and creates a result that is impossible to be a literal copy. In other words, there are facts that cut for and against both sides, as well as facts that distinguish these cases. This can be quite unpredictable. Even RAM was once held to be a type of copying in MAI Systems Corp. v. Peak Computer, Inc., 991 F.2d 511 (9th Cir. 1993) until that holding was overturned by Congress by an amendment to 17 U.S.C. § 117.
It is very non-obvious what the right policy goal here is. How much do we want AI systems to comply with requests? How many assumptions do we want AI models to make in determining when to apply a guardrail? How much do we just want large language models to just be tools, as opposed to tools that regulate us? How do we decide what’s allowed and what’s “safe?” My purpose in raising these implications is this: although courts are not supposed to consider policy implications like this, in blockbuster cases they often do. One recent article observed that courts may find a way to split the baby simply because there is now so much commercial value in large language models. They may also determine that the policy implications are so great that this should be left to Congress. Regardless, it seems to me that the question of responsibility is actually the issue with the most real-world implications for model development and guardrails going forward.
"The Second Circuit held that Google’s book product did not engage in copyright infringement, neither for the literal copying of books to create its database nor for its “snippet” feature that allowed users to search a book for any passage and display that snippet in the context of the book without recreating the entire book. It is difficult to see how inference is not more transformative than this use."
This doesn't seem that difficult to me. A search engine is a means for finding new books, clearly different from the original. In contrast, generative models can be used to create works that compete directly with the works they were trained on—some image models even allow you to ask for works "in the style of" their training data. It's easy to imagine a court holding that this cuts against a finding of transformativeness, and even more important that this counts against the defendants for the "effect on the market" factor.