Skip to content
Home » Transcript: How AI Models Steal Creative Work — and What to Do About It: Ed Newton-Rex

Transcript: How AI Models Steal Creative Work — and What to Do About It: Ed Newton-Rex

Read the full transcript of AI expert Ed Newton-Rex’s talk titled “How AI Models Steal Creative Work — and What to Do About It” at TEDAI San Francisco on October 22, 2024.

Listen to the audio version here:

TRANSCRIPT:

The Problem with AI Training Data

ED NEWTON-REX: The technology and vision behind generative AI is amazing, but stealing the work of the world’s creators to build it is not. There are three key resources that AI companies need to build their models: people, compute, and data. That is engineers to build the models, GPUs to run the training process, and data to train the models on. AI companies spend vast sums on the first two, sometimes a million dollars per engineer and up to a billion dollars per model, but they expect to take the third resource, training data, for free.

Right now, many AI companies train on creative work they haven’t paid for or even asked permission to use. This is unfair and unsustainable. But if we reset and license our training data, we can build a better generative AI ecosystem that works for everyone, both the AI companies themselves and the creators, without whose work these models would not exist.

Most AI companies today do not license the majority of their training data. They use web scrapers to find, download, and train on as much content as they can gather. They’re often pretty secretive about what they do train on, but what’s clear is that training on copyrighted work without a license is rife. For instance, when the Mozilla Foundation looked at 47 large language models released between 2019 and 2023, they found that 64% of them were trained in part on Common Crawl, a dataset that includes copyrighted works, such as newspaper articles from major publications, and a further 21% didn’t reveal enough information to know either way.

AI Competes With Its Training Data

Training on copyrighted work without a license has rapidly become standard across much of the generative AI industry. But this unlicensed training on creative work has serious negative consequences for the people behind that work, and this is for the simple reason that generative AI competes with its training data. This is not the narrative that AI companies like to portray. We like to talk about democratization, about letting more people be creative, but the fact that AI competes with its training data is inescapable.

A large language model trained on short stories can create competing short stories. An AI image model trained on stock images can create competing stock images. An AI music model trained on music that’s licensed to TV shows can create competing music to license to TV shows. These models, however imperfect, are so quick and easy to use that this competition is inevitable.

And this isn’t just theoretical. Generative AI is still pretty new, but we’re already seeing exactly the sort of effects you’d expect in a world in which generative AI competes with its training data. For instance, the well-known filmmaker Ramgopal Varma recently said that he’ll use AI music in all his projects going forward. Indeed, there are multiple reports of people starting to listen to AI music in place of human-produced music, and recently an AI song hit number 48 in the German chart. In all these cases, AI music is competing with the songs it was trained on.

Real-World Impacts on Creators

Or take Kelly McKernan. Kelly is an artist from Nashville. For 10 years, they made enough money selling their work that art was their full-time income. But in 2022, a data set that included their works was used to train a popular AI image model. Their name was one of many used by huge numbers of people to create art in the style of specific human artists. Kelly’s income fell by 33% almost overnight. Illustrators around the world report similar stories being out-competed by AI models they have reason to believe were trained on their work.

ALSO READ:  NVIDIA CEO Jensen Huang on China, AI & U.S. Competitiveness at CSIS (Transcript)

The freelance platform Upwork wrote a white paper in which they looked at the effects that they’ve seen on the job market of generative AI. They looked at how job postings on their platform have changed since the introduction of ChatGPT, and sure enough, they found exactly what you’d expect, that generative AI has reduced the demand for freelance writing tasks by 8%, which increases to 18% if you look at only what they term lower-value tasks.

So the initial data we have, plus the individual stories we hear, all align with the logical assumption: generative AI competes with the work it’s trained on. It’s so quick and easy to use, it’s inevitable, and it competes with the people behind that work.

The Legal Question of Copyright

Now, creators argue this training is illegal. The legal framework of copyright affords creators the exclusive right to authorize copies of their work, and AI training involves copying. Here in the US, many AI companies argue that training AI falls under the fair use copyright exception, which allows unlicensed copying in a limited set of circumstances, such as creating parodies of a work.

Creators and rights holders strongly disagree, saying there’s no way this narrow exception can be used to legitimize the mass exploitation of creative work, to create automated competitors to that work. And for the record, I entirely agree. Of course, this question is previously untested in the courts, and there are currently around 30 ongoing lawsuits brought by rights holders against AI companies, which will help to address this question. But this will take time, and creators are suffering from what they see as unjust competition right now.

Licensing as a Solution

So they propose a solution that has been used and worked before: licensing. If a commercial entity wants to use copyrighted work, be it for merchandise manufacturing or building a streaming service, they license that work.

Now, AI companies have a bunch of reasons why this shouldn’t apply to them. There’s the fair use legal exception that I’ve already mentioned. There’s also the argument that since humans can train on copyrighted work without a license, AI should be allowed to too.