With Spatial Intelligence, AI Will Understand The Real World: Fei-Fei Li (Transcript)

Here is the full transcript of AI pioneer Fei-Fei Li’s talk titled “With Spatial Intelligence, AI Will Understand The Real World” at TED 2024 conference.

Listen to the audio version here:

TRANSCRIPT:

The World 540 Million Years Ago

Let me show you something. To be precise, I’m going to show you nothing. This was the world 540 million years ago. Pure, endless darkness.

It wasn’t dark due to a lack of light. It was dark because of a lack of sight. Although sunshine did filter 1,000 meters beneath the surface of ocean, a light permeated from hydrothermal vents to seafloor, brimming with life, there was not a single eye to be found in these ancient waters. No retinas, no corneas, no lenses.

So all this light, all this life went unseen. There was a time that the very idea of seeing didn’t exist. It had simply never been done before. Until it was.

The Emergence of Trilobites

So for reasons we’re only beginning to understand, trilobites, the first organisms that could sense light, emerged. They’re the first inhabitants of this reality that we take for granted. First to discover that there is something other than oneself. A world of many selves.

The ability to see is thought to have ushered in Cambrian explosion, a period in which a huge variety of animal species entered fossil records. What began as a passive experience, the simple act of letting light in, soon became far more active. The nervous system began to evolve. Sight turning to insight.

Understanding led to actions. And all these gave rise to intelligence. Today, we’re no longer satisfied with just nature’s gift of visual intelligence. Curiosity urges us to create machines to see just as intelligently as we can, if not better.

The Convergence of Neural Networks, GPUs, and Big Data

Nine years ago, on this stage, I delivered an early progress report on computer vision, a subfield of artificial intelligence. Three powerful forces converged for the first time. A family of algorithms called neural networks. Fast, specialized hardware called graphic processing units, or GPUs. And big data.

Like the 15 million images that my lab spent years curating called ImageNet. Together, they ushered in the age of modern AI. We’ve come a long way. Back then, just putting labels on images was a big breakthrough.

But the speed and accuracy of these algorithms just improved rapidly. The annual ImageNet challenge, led by my lab, gauged the performance of this progress. And on this plot, you’re seeing the annual improvement and milestone models. We went a step further and created algorithms that can segment objects or predict the dynamic relationships among them in these works done by my students and collaborators.

From Image Captioning to Text-to-Image Generation

And there’s more. Recall last time I showed you the first computer-vision algorithm that can describe a photo in human natural language. That was work done with my brilliant former student, Andrej Karpathy. At that time, I pushed my luck and said, “Andrej, can we make computers to do the reverse?”

ALSO READ: Why AI Is A Threat - And How To Use It For Good: John Tasioulas (Transcript)

And Andrej said, “Ha ha, that’s impossible.” Well, as you can see from this post, recently the impossible has become possible. That’s thanks to a family of diffusion models that powers today’s generative AI algorithm, which can take human-prompted sentences and turn them into photos and videos of something that’s entirely new. Many of you have seen the recent impressive results of Sora by OpenAI.

But even without the enormous number of GPUs, my student and our collaborators have developed a generative video model called Walt months before Sora. And you’re seeing some of these results. There is room for improvement. I mean, look at that cat’s eye and the way it goes under the wave without ever getting wet. What a cat-astrophe.

The Virtuous Cycle of Seeing and Doing

And if past is prologue, we will learn from these mistakes and create a future we imagine. And in this future, we want AI to do everything it can for us, or to help us. For years I have been saying that taking a picture is not the same as seeing and understanding. Today, I would like to add to that. Simply seeing is not enough.

Seeing is for doing and learning. When we act upon this world in 3D space and time, we learn, and we learn to see and do better. Nature has created this virtuous cycle of seeing and doing powered by “spatial intelligence.” To illustrate to you what your spatial intelligence is doing constantly, look at this picture. Raise your hand if you feel like you want to do something.

The Emergence of Spatial Intelligence in AI

In the last split of a second, your brain looked at the geometry of this glass, its place in 3D space, its relationship with the table, the cat and everything else. And you can predict what’s going to happen next. The urge to act is innate to all beings with spatial intelligence, which links perception with action. And if we want to advance AI beyond its current capabilities, we want more than AI that can see and talk.

We want AI that can do. Indeed, we’re making exciting progress. The recent milestones in spatial intelligence is teaching computers to see, learn, do and learn to see and do better. This is not easy.

It took nature millions of years to evolve spatial intelligence, which depends on the eye taking light, project 2D images on the retina and the brain to translate these data into 3D information. Only recently, a group of researchers from Google are able to develop an algorithm to take a bunch of photos and translate that into 3D space, like the examples we’re showing here. My student and our collaborators have taken a step further and created an algorithm that takes one input image and turn that into 3D shape. Here are more examples.

ALSO READ: Alex Tapscott: Blockchain is Eating Wall Street at TEDxSanFrancisco (Transcript)

From Text to 3D Scene Generation

Recall, we talked about computer programs that can take a human sentence and turn it into videos.

A group of researchers in University of Michigan have figured out a way to translate that line of sentence into 3D room layout, like shown here. And my colleagues at Stanford and their students have developed an algorithm that takes one image and generates infinitely plausible spaces for viewers to explore. These are prototypes of the first budding signs of a future possibility.

One in which the human race can take our entire world and translate it into digital forms and model the richness and nuances. What nature did to us implicitly in our individual minds, spatial intelligence technology can hope to do for our collective consciousness. As the progress of spatial intelligence accelerates, a new era in this virtuous cycle is taking place in front of our eyes.

This back and forth is catalyzing robotic learning, a key component for any embodied intelligence system that needs to understand and interact with the 3D world. A decade ago, ImageNet from my lab enabled a database of millions of high-quality photos to help train computers to see. Today, we’re doing the same with behaviors and actions to train computers and robots how to act in the 3D world.

Robotic Language Intelligence

But instead of collecting static images, we develop simulation environments powered by 3D spatial models so that the computers can have infinite varieties of possibilities to learn to act. And you’re just seeing a small number of examples to teach our robots in a project led by my lab called Behavior. We’re also making exciting progress in robotic language intelligence.

Using large language model-based input, my students and our collaborators are among the first teams that can show a robotic arm performing a variety of tasks based on verbal instructions, like opening this drawer or unplugging a charged phone. Or making sandwiches, using bread, lettuce, tomatoes and even putting a napkin for the user. Typically I would like a little more for my sandwich, but this is a good start.

The Digital Cambrian Explosion

In that primordial ocean, in our ancient times, the ability to see and perceive one’s environment kicked off the Cambrian explosion of interactions with other life forms. Today, that light is reaching the digital minds. Spatial intelligence is allowing machines to interact not only with one another, but with humans, and with 3D worlds, real or virtual. And as that future is taking shape, it will have a profound impact to many lives.

Let’s take health care as an example. For the past decade, my lab has been taking some of the first steps in applying AI to tackle challenges that impact patient outcome and medical staff burnout. Together with our collaborators from Stanford School of Medicine and partnering hospitals, we’re piloting smart sensors that can detect clinicians going into patient rooms without properly washing their hands. Or keep track of surgical instruments.

ALSO READ: The Mind Behind Tesla, SpaceX, SolarCity – A Fireside Chat with Elon Musk (Transcript)

Ambient Intelligence in Healthcare

Or alert care teams when a patient is at physical risk, such as falling. We consider these techniques a form of ambient intelligence, like extra pairs of eyes that do make a difference. But I would like more interactive help for our patients, clinicians and caretakers, who desperately also need an extra pair of hands.

Imagine an autonomous robot transporting medical supplies while caretakers focus on our patients or augmented reality, guiding surgeons to do safer, faster and less invasive operations. Or imagine patients with severe paralysis controlling robots with their thoughts. That’s right, brainwaves, to perform everyday tasks that you and I take for granted. You’re seeing a glimpse of that future in this pilot study from my lab recently.

In this video, the robotic arm is cooking a Japanese sukiyaki meal controlled only by the brain electrical signal, non-invasively collected through an EEG cap.

The Full Potential of the Digital Cambrian Explosion

The emergence of vision half a billion years ago turned a world of darkness upside down. It set off the most profound evolutionary process: the development of intelligence in the animal world. AI’s breathtaking progress in the last decade is just as astounding. But I believe the full potential of this digital Cambrian explosion won’t be fully realized until we power our computers and robots with spatial intelligence, just like what nature did to all of us.

It’s an exciting time to teach our digital companion to learn to reason and to interact with this beautiful 3D space we call home, and also create many more new worlds that we can all explore. To realize this future won’t be easy. It requires all of us to take thoughtful steps and develop technologies that always put humans in the center.

A Future of Trusted AI Partners

But if we do this right, the computers and robots powered by spatial intelligence will not only be useful tools but also trusted partners to enhance and augment our productivity and humanity while respecting our individual dignity and lifting our collective prosperity. What excites me the most in the future is a future in which that AI grows more perceptive, insightful and spatially aware, and they join us on our quest to always pursue a better way to make a better world. Thank you.