Skip to content
Home » Transcript: How AI Models Steal Creative Work — and What to Do About It: Ed Newton-Rex

Transcript: How AI Models Steal Creative Work — and What to Do About It: Ed Newton-Rex

Read the full transcript of AI expert Ed Newton-Rex’s talk titled “How AI Models Steal Creative Work — and What to Do About It” at TEDAI San Francisco on October 22, 2024.

Listen to the audio version here:

TRANSCRIPT:

The Problem with AI Training Data

ED NEWTON-REX: The technology and vision behind generative AI is amazing, but stealing the work of the world’s creators to build it is not. There are three key resources that AI companies need to build their models: people, compute, and data. That is engineers to build the models, GPUs to run the training process, and data to train the models on. AI companies spend vast sums on the first two, sometimes a million dollars per engineer and up to a billion dollars per model, but they expect to take the third resource, training data, for free.

Right now, many AI companies train on creative work they haven’t paid for or even asked permission to use. This is unfair and unsustainable. But if we reset and license our training data, we can build a better generative AI ecosystem that works for everyone, both the AI companies themselves and the creators, without whose work these models would not exist.

Most AI companies today do not license the majority of their training data. They use web scrapers to find, download, and train on as much content as they can gather. They’re often pretty secretive about what they do train on, but what’s clear is that training on copyrighted work without a license is rife. For instance, when the Mozilla Foundation looked at 47 large language models released between 2019 and 2023, they found that 64% of them were trained in part on Common Crawl, a dataset that includes copyrighted works, such as newspaper articles from major publications, and a further 21% didn’t reveal enough information to know either way.

AI Competes With Its Training Data

Training on copyrighted work without a license has rapidly become standard across much of the generative AI industry.