While LLMs have been riding a unicorn hype wave, the participants of the training set have started to sue our future robot overlords. Sarah Silverman sued OpenAI for possibly using her book to train GPT on the heels of a class action lawsuit that was filed against Github, both alleging copyright infringement. So far neither case has been dismissed but in the end, sorry guys, they are going to get crushed.
Paintings, jokes, and other creative acts are protected by an intellectual property regime called copyright. Copyright is the easiest type of intellectual property to get, and in exchange it offers the weakest protection. The basic outline of copyright is this: if you have a creative work, people cannot copy it, distribute it, and so on. You get the right to make the sequels, translations, and so on, which are called derivative works. You can even stop people from adapting it from one format to another or using your work in certain noncommercial contexts like political rallies.
The thing about copyright is that, as the name suggests, it is all about stopping people from making copies of the work; importantly, you cannot copyright an idea. Therefore, you can’t stop people from creating their own creative works, like lists that mention your works or analyses of your creations, which is called transformative work.1 Parodies, for example, are transformative because even though they often involve the use of copyrighted material, they transform the material in a permitted way.
The AI companies are going to wipe the floor with these litigants using copyright law as their towel because it’s basically impossible to argue that machine learning isn’t transformative use. Somehow, you can mix Bo Burnham jokes with Picasso paintings and out can come ideas to help you with your new microfluidics project. If that isn’t transformative, I don’t know what is, and it’s the reason neither complaint even mentions the issue. And as for making copies of the works for training sets, copying alone isn’t always illegal, and any sane court is going to grant that this is fair use. Authors Guild v. Google, the Google Books case, said that the book scanning was fair use.2 Google Books to this day displays literal snippet views of books, which was held as transformative; LLMs produce original nondeterminative text, which is clearly even more so. This is, of course, a very simplified analysis (that is not legal advice!) that is not intended to be at the level of a legal brief, so take it with a grain of salt.
It is deeply ironic that this lawsuit was filed by Sarah Silverman because no category of creator is more dependent on fair use law than comedians, who depend on parody law, a type of transformative use, to have a job. Make no mistake, this class action suite is an attempt to shoehorn having a copy of her book, which they are allowed to do, into a legal argument that LLMs are derivative, rather than transformative, works. It looks both weak and a little insane.
What all of these people really want is a patent. Patents have a totally different set of rights intended to incentivize different behaviors. They are designed for the field of engineering, where people discover and invent technologies, sometimes at great expense, and would otherwise keep their inventions secret. The societal trade with patents is the absolute right to stop anyone from commercially using your idea in exchange for teaching the public how it works in excruciating detail.3 Copyrights are porous because the exceptions are legion, but if you are within the four corners of a patent the only question is how much you’ll be paying.
Most pertinently, patents have no concept of transformative use. If you invent something and patent it, you can “block” people from using your idea even if it is a part of their own completely new inventions. We allow this with patents because lots of important inventions that deserve compensation aren’t products per se. But copyright is brittle because creativity is part of the human condition. We want people to be able to express themselves, and references are an important part of that, which means that that the bounds are highly limited on purpose. The whole point of the OpenAI lawsuits are that the litigants don’t believe that spirit should apply to machine learning.
Is there a way out of this for Sarah Silverman? Sure. Intellectual property is entirely created by legislatures. There is no common law concept of IP, at all, in any country, and no Constitution anywhere imposes restrictions on what those rules must look like. The divisions we have between copyright and patent are conventions we have developed because it is commonly accepted that the creative arts and engineering both deserve IP protection, but have different contexts and incentives that merit different regimes.³ This has been the balance for centuries, but what Congress has created, Congress can change.
Copyright could have yet another exception for machine learning use of copyrighted material that is not fair use. It would just require another law. Congress can govern that exception any way they like. They could make it look more patent-like if they wanted. The basic premise would be that if you are using the copyrighted work to create a software product, or a machine learning product, or something like that, the work would not have the transformative use or fair use get-out-of-jail-free card it would normally enjoy under copyright.
I don’t see this happening, frankly. There is likely not much appetite for changes to copyright in the United States; the last change to the Copyright Act was nicknamed the Mickey Mouse Protection Act and Congress hates Hollywood right now, so you do the math. I don’t think anywhere else would do it, either. Most Asian countries are pro-AI, so they would likely not implement changes to copyright that would discourage machine learning. The EU is very willing to go its own way with regulating AI, and it does have an expanded version of copyright called moral rights, so it is clearly not averse to having a stricter notion of copyright. But modifying copyright in that way would undermine over a century of settled business models and hurt homegrown EU data companies, which they would be loth to do.
The alternative is just regulating machine learning, which might be what Eliezer Yudkowsky wants for other reasons, but that isn’t happening, either. How do you determine what the licensing value of a single work is? There are also so many features now that use machine learning. Who would be subject to the regulation? What if it just makes it impossible to offer those features in certain places? That’s what happened with link taxes and almost what happened with GDPR. Can you get around those rules by just doing the training elsewhere? It just becomes impossible.
And frankly, it would be be undesirable. We want machines to be able to learn from the world, and creative works are part of it. The point of all IP is to allow you to protect the products that come out of your mind, not to stake a claim on the inventiveness of society as a whole. Creating a machine learning creative rights patents would be anathema to the idea of creating new things and learning itself. It would be madness.
There will ultimately be some commercial arrangement that enables creators to contribute to the AI revolution and get paid, but being part of a large corpus that enables LLMs to work well isn’t it. Most likely, as in everything else relating to Hollywood, it will have to do with official branding and merch.
Unlike copyrights, where your creation is presumed to be copyrightable, inventions are presumed to be unpatentable, and you have to earn your patent, again at great expense. Patent examiners may disagree with you, and it can take years just to get a patent. The main reason for this is that the long history of inventions is keeping things secret. Until the Venetians invented patents, most inventions were kept as trades among craftsmen. They often died with their workshops or guilds; we still don’t know how most Roman inventions worked. This was terrible for science and meant the only way to protect an invention was a trade secret. There are many issues with patents, but the alternative is worse.
It was so clear it was granted on summary judgment and unanimously affirmed on appeal. That means that the judge thought it was so obviously in favor of Google that he did not want to waste a jury’s time, and that a panel of the judge’s seniors agreed. Of course, the Google Books case was in the Second Circuit and these cases were filed in the Ninth Circuit, so the Google Books case doesn’t bind. The Github and Silverman case were both filed by the same attorney, who is no doubt aware of this case and likely trying to avoid its precedential power.
It is also a convention that IP is one size fits all. If you have a patent, for example, it is just a patent. There is no such thing as a drug patent vs a semiconductor patent. People have proposed versions of this over the years, most notably different patent terms, but the fundamental problem is that if there is ever a gray area, people will fight like hell to argue that they are in the more favorable category. So if helicopter patents are easier to get than plane patents, you can bet that suddenly Boeing will start saying it makes helicopters (at least when talking to USPTO). And the patent office backlog is getting worse every year. Do we want them to have one more thing to fight about with patent applicants?