Here is the full transcript of AI pioneer Fei-Fei Li’s talk titled “With Spatial Intelligence, AI Will Understand The Real World” at TED 2024 conference.
Listen to the audio version here:
TRANSCRIPT:
The World 540 Million Years Ago
Let me show you something. To be precise, I’m going to show you nothing. This was the world 540 million years ago. Pure, endless darkness.
It wasn’t dark due to a lack of light. It was dark because of a lack of sight. Although sunshine did filter 1,000 meters beneath the surface of ocean, a light permeated from hydrothermal vents to seafloor, brimming with life, there was not a single eye to be found in these ancient waters. No retinas, no corneas, no lenses.
So all this light, all this life went unseen. There was a time that the very idea of seeing didn’t exist. It had simply never been done before. Until it was.
The Emergence of Trilobites
So for reasons we’re only beginning to understand, trilobites, the first organisms that could sense light, emerged. They’re the first inhabitants of this reality that we take for granted. First to discover that there is something other than oneself. A world of many selves.
The ability to see is thought to have ushered in Cambrian explosion, a period in which a huge variety of animal species entered fossil records. What began as a passive experience, the simple act of letting light in, soon became far more active. The nervous system began to evolve. Sight turning to insight.
Understanding led to actions. And all these gave rise to intelligence. Today, we’re no longer satisfied with just nature’s gift of visual intelligence. Curiosity urges us to create machines to see just as intelligently as we can, if not better.
The Convergence of Neural Networks, GPUs, and Big Data
Nine years ago, on this stage, I delivered an early progress report on computer vision, a subfield of artificial intelligence. Three powerful forces converged for the first time. A family of algorithms called neural networks. Fast, specialized hardware called graphic processing units, or GPUs. And big data.
Like the 15 million images that my lab spent years curating called ImageNet. Together, they ushered in the age of modern AI. We’ve come a long way. Back then, just putting labels on images was a big breakthrough.
But the speed and accuracy of these algorithms just improved rapidly. The annual ImageNet challenge, led by my lab, gauged the performance of this progress. And on this plot, you’re seeing the annual improvement and milestone models. We went a step further and created algorithms that can segment objects or predict the dynamic relationships among them in these works done by my students and collaborators.
From Image Captioning to Text-to-Image Generation
And there’s more. Recall last time I showed you the first computer-vision algorithm that can describe a photo in human natural language. That was work done with my brilliant former student, Andrej Karpathy. At that time, I pushed my luck and said, “Andrej, can we make computers to do the reverse?”
And Andrej said, “Ha ha, that’s impossible.” Well, as you can see from this post, recently the impossible has become possible. That’s thanks to a family of diffusion models that powers today’s generative AI algorithm, which can take human-prompted sentences and turn them into photos and videos of something that’s entirely new. Many of you have seen the recent impressive results of Sora by OpenAI.
But even without the enormous number of GPUs, my student and our collaborators have developed a generative video model called Walt months before Sora. And you’re seeing some of these results. There is room for improvement. I mean, look at that cat’s eye and the way it goes under the wave without ever getting wet. What a cat-astrophe.
The Virtuous Cycle of Seeing and Doing
And if past is prologue, we will learn from these mistakes and create a future we imagine. And in this future, we want AI to do everything it can for us, or to help us. For years I have been saying that taking a picture is not the same as seeing and understanding. Today, I would like to add to that. Simply seeing is not enough.
Seeing is for doing and learning. When we act upon this world in 3D space and time, we learn, and we learn to see and do better. Nature has created this virtuous cycle of seeing and doing powered by “spatial intelligence.” To illustrate to you what your spatial intelligence is doing constantly, look at this picture. Raise your hand if you feel like you want to do something.
The Emergence of Spatial Intelligence in AI
In the last split of a second, your brain looked at the geometry of this glass, its place in 3D space, its relationship with the table, the cat and everything else. And you can predict what’s going to happen next. The urge to act is innate to all beings with spatial intelligence, which links perception with action. And if we want to advance AI beyond its current capabilities, we want more than AI that can see and talk.
We want AI that can do. Indeed, we’re making exciting progress. The recent milestones in spatial intelligence is teaching computers to see, learn, do and learn to see and do better. This is not easy.
It took nature millions of years to evolve spatial intelligence, which depends on the eye taking light, project 2D images on the retina and the brain to translate these data into 3D information. Only recently, a group of researchers from Google are able to develop an algorithm to take a bunch of photos and translate that into 3D space, like the examples we’re showing here. My student and our collaborators have taken a step further and created an algorithm that takes one input image and turn that into 3D shape. Here are more examples.
From Text to 3D Scene Generation
Recall, we talked about computer programs that can take a human sentence and turn it into videos.