Transcript of Nobel Prize lecture: Geoffrey Hinton, Nobel Prize in Physics 2024

Read the full transcript of Geoffrey Hinton’s Nobel Prize lecture “Boltzmann Machines” on 8 December 2024 at the Aula Magna, Stockholm University. He was introduced by Professor Ellen Moons, Chair of the Nobel Committee for Physics. The Nobel Prize in Physics 2024 was awarded jointly to John J. Hopfield and Geoffrey E. Hinton “for foundational discoveries and inventions that enable machine learning with artificial neural networks”.

TRANSCRIPT:

Introduction by Professor Ellen Moons

[PROFESSOR ELLEN MOONS:] It is now my pleasure and great honor to introduce our second speaker, Geoffrey Hinton. Geoffrey Hinton was born in London, UK in 1947. He received a bachelor degree in experimental psychology from Cambridge University in 1970.

In 1978 he was awarded a PhD in artificial intelligence from the University of Edinburgh. After postdoctoral research, he worked for five years as a faculty member in computer science at Carnegie Mellon University in Pittsburgh. In 1987 he was appointed professor of computer science at the University of Toronto, Canada, where he presently is emeritus professor.

Between 2013 and 2023, he shared his time between academic research and Google Brain. Please join me in welcoming Geoffrey Hinton to the stage to tell us about the developments that led to this year’s Nobel Prize in physics.

Understanding Hopfield Networks

[GEOFFREY HINTON:] So today I’m going to do something very foolish. I’m going to try and describe a complicated technical idea for a general audience without using any equations.

First I have to explain Hopfield nets, and I’m going to explain the version with binary neurons that have states of one or zero. On the right there you’ll see a little Hopfield net, and the most important thing is the neurons have symmetrically weighted connections between them.

The global state of a whole network is called a configuration, just so we seem a bit like physics, and each configuration has a goodness.

The goodness of the configuration is simply the sum of all pairs of neurons that are both on of the weights between them. So those weights in red boxes, you add those up and you get four, hopefully. That’s the goodness of that configuration of the network, and the energy is just minus the goodness.

These networks will settle to energy minima. The whole point of a Hopfield net is that each neuron can locally compute what it needs to do in order to reduce the energy, where energy is badness. So if the total weighted input coming from other active neurons is positive, the neuron should turn on. If the total weighted input coming from other active neurons is negative, it should turn off.

If each neuron just keeps using that rule, and we pick them at random and keep applying that rule, we will eventually settle to an energy minimum. So the configuration on the right there is actually an energy minimum. It has an energy of minus four, and if you take any neuron there, the ones that are on want to stay on. They get total positive input. The ones that are off want to stay off. They get total negative input.

Multiple Energy Minima

But it’s not the only energy minimum. A Hopfield net can have many energy minima, and where it ends up depends on where you started, and also on the sequence of random decisions about which neuron to update.

That’s a better energy minimum. Now we’ve turned on the triangle of units on the right, and that’s got a goodness of three plus three minus one is five, and so an energy minus five, that’s a better minimum.

Now Hopfield proposed that a good way to use such networks is to make the energy minima correspond to memories, and then using that binary decision rule about whether you should turn a neuron on or off, that can clean up incomplete memories. So you start with a partial memory, and then you just keep applying this decision rule, and it will clean it up. So settling to energy minima, when they represent memories, is a way of having a content addressable memory.

You can access an item in the memory by just turning on some of the item, and then using this rule, and it’ll fill it out. Terry Sejnowski and I, Terry was a student of Hopfield’s, proposed a different use for these kinds of nets. Instead of using them to store memories, we could use them to construct interpretations of sensory input.

Constructing Interpretations

The idea is you have a net, it has both visible neurons and hidden neurons. The visible neurons are where you show it a sensory input, maybe a binary image. The hidden neurons are where it constructs the interpretation of that sensory input, and the energy of a configuration of the network represents the badness of the interpretation, so we want low energy interpretations.

I’m going to give you a concrete example. Consider that ambiguous line drawing at the top. People have two ways of seeing that. There’s interpretation one, which is normally what you see first. There’s another interpretation, and when you see it as a convex object, that’s clearly a different 3D interpretation of the same 2D line drawing.

So could we make one of these networks come up with two different interpretations of the same line drawing? Well, we need to start by thinking what a line in an image tells you about 3D edges.

From 2D Lines to 3D Interpretations

That green line is the image plane. Imagine you’re looking through a window and you’re drawing the edges in the scene out there in the world on the window. So that little black line is a line in the image, and the two red lines are the lines of sight that come from your eye through the ends of that line.

And if you ask, well, what edge in the world could have caused that? Well, there’s many edges that could have caused it. There’s one edge that could have caused that 2D line, but there’s another one, and there’s another one, and there’s another one. All of these edges will cause the same line in the image.

ALSO READ: Preventing Divorce: Top 3 Life Hacks for Singles by George Blair-West (Transcript)

So the problem of vision is to go backwards from the single line in the image to figure out which of these edges is really out there. You can only see one of them at a time if objects are opaque because they all get in each other’s way. So you know that that line in the image has to depict one of these edges, but you don’t know which one.

Building a Network for Visual Interpretation

We could build a network where we started off by turning the lines into activations of line neurons. So let’s suppose we already have that. We have a large number of neurons to represent lines in the image, and we turn on just a few of them to represent the lines in this particular image.

Now, each of those lines could depict a number of different 3D edges. So what we do is we connect that line neuron to a whole bunch of 3D edge neurons with excitatory connections. Those are the green ones. But we know we can only see one of those at a time, so we make those edge neurons inhibit each other.

So now we’ve captured a lot about the sort of optics of perception. We do that for all of our line neurons. And now the question is, which of those edge neurons should we turn on? For that, we need more information.

Visual Interpretation Principles

And there’s certain principles we use in interpreting images. If you see two lines in an image, you assume that if they join in the image, they join in depth where they join. That is, they’re at the same depth where the two lines join in the image.

So we can put in extra connections for that. We could put in a connection between every pair of 3D edge neurons that join in depth at the point where they have the same end. We could put in a stronger connection if they join at right angles. We really like to see images in which things join at right angles.

So we put in a whole bunch of connections like that. And now what we hope is, if we set the connection strengths right, that we’ve got a network which has two alternative states it can settle to, corresponding to those two alternative interpretations of the Necker cube.

Two Main Problems

This gives rise to two main problems. The first problem, if we’re going to use hidden neurons to come up with interpretations of images represented in the states of the visible neurons, is the search issue. How do we avoid getting trapped in local optima? We might settle to a rather poor interpretation and not be able to jump out of it to a better interpretation.

And the second problem is learning. I sort of implied I’d put in all those connections by hand, but we’d like a neural network to put in all those connections.

The Search Problem and Noisy Neurons

The search problem we solve, more or less, by making the neurons noisy. So if you have deterministic neurons, like in a standard Hopfield net, if the system settled into one energy minimum, like A, so the ball there is the configuration of the whole system, it can’t get from A to B because the decision rule for the neurons only allows things to go downhill in energy.

And the graph on the right is the decision rule. If the input’s positive, turn on. If the input’s negative, turn off. We would like to be able to get from A to B, but that means we have to go uphill in energy.

The solution to that is to have noisy neurons, stochastic binary neurons. They still only have binary states. Their states are either 1 or 0, but they’re probabilistic. If they get a big positive input, they almost always turn on. With a big negative input, they almost always turn off.

But if the input is soft, if it’s somewhere near 0, then they behave probabilistically. If it’s positive, they usually turn on but occasionally turn off. And if it’s a small negative input, they usually turn off but occasionally turn on. But they don’t have real values. They’re always binary, but they make just these probabilistic decisions.

Interpreting Images with Stochastic Networks

And so now if we want to interpret a binary image using these hidden neurons, what we do is we clamp the binary image on the visible units. That specifies what the input is. And then we pick a hidden neuron at random. We look at the total input it’s getting from the other active hidden neurons and we start them all off in random states.

And if it gets total positive input, we probably turn it on, but we might just turn it off if it’s only a small positive input. So we keep implementing this rule of turn them on if they’re big positive input, off if they’re big negative input, but if they’re soft, make probabilistic decisions.

And if we go around and we keep picking hidden neurons and doing that, the system will eventually approach what’s called thermal equilibrium. That’s a difficult concept for non-physicists and I’ll explain it later.

Once it’s reached thermal equilibrium, the states of the hidden neurons are then an interpretation of that input. So in the case of that line drawing, the hidden neurons you’d hopefully have one hidden neuron turned on for each line unit and you get an interpretation which will be one of those two interpretations of the Necker cube. And what we hope is that the low energy interpretations will be good interpretations of the data.

Understanding Thermal Equilibrium

So for this line drawing, if we could learn the right weights between the 2D line neurons and the 3D edge neurons and learn the right weights between the 3D edge neurons, then hopefully the low energy states of the network would correspond to good interpretations, namely seeing 3D rectangular objects.

ALSO READ: Dananjaya Hettiarachchi at World Champion of Public Speaking 2014 Speech (Full Transcript)

Thermal equilibrium. It’s not what you first expect, which is that the system is settled to a stable state. What’s stabilized is not the state of the system. What’s stabilized is a far more abstract thing that’s hard to think about. It’s the probability distribution over configurations of the system.

That’s very hard for a normal person to think about. It settles to a particular distribution called the Boltzmann distribution. And in the Boltzmann distribution, the probability, once it’s settled to thermal equilibrium, of finding the system in a particular configuration is determined solely by the energy of that configuration, and you have more probability of finding it in lower energy configurations.

So thermal equilibrium, the good states, the low energy states, are more probable than the bad states.

The Ensemble Approach to Understanding

Now to think about thermal equilibrium, there’s a trick physicists use, and it allows ordinary people to understand this concept, hopefully. You just imagine a very large ensemble, gazillions of them, of identical networks.

You have these gazillion Hopfield networks. They all have exactly the same weights, so they’re the same system essentially, but you start them all off in different random states, and they all make their own independent random decisions, and there’ll be a certain fraction of the systems that have each configuration.

And to begin with that fraction will just depend on how you started them off. Maybe you start them off randomly, so all configurations are equally likely. And in this huge ensemble, you’ll get equal numbers of systems in every possible configuration.

But then you start running this algorithm of update neurons in such a way that they tend to lower the energy, but occasionally like to go up, and gradually what will happen is the fraction of the systems in any one configuration will stabilize.

So any one system will be jumping between configurations, but the fraction of all the systems in a particular configuration will be stable. So one system may leave a configuration, but other systems will go into that configuration. This is called detailed balance, and the fraction of systems will stay stable.

Generating Images

That’s it for the physics. So let’s imagine generating an image now, not interpreting an image, but generating an image. To generate an image, you start by picking random states for all of the hidden neurons and the visible neurons.

Then you pick a hidden or visible neuron, and you update its state using the usual stochastic rule. If it’s got lots of positive input, probably turn it on. Lots of negative input, probably turn it off. If it’s soft, it behaves a bit stochastically.

And you keep doing that. And if you keep doing that repeatedly until the systems approach thermal equilibrium, then you look at the states of the visible units, and that’s now an image generated by this network from the distribution it believes in.

The Boltzmann Distribution and Machine Learning

The Boltzmann distribution operates on the principle that low energy configurations are much more likely than high energy configurations. It encompasses many possible alternative images, allowing you to select one of its beliefs by running this process. The aim of learning in a Boltzmann machine is to ensure that when the network generates images—essentially dreaming or randomly imagining things—those images resemble the ones it perceives when processing real images.

If we can achieve this alignment, the states of the hidden neurons will effectively capture the underlying causes of the image. In other words, learning the weights in the network is equivalent to figuring out how to use those hidden neurons so that the network will generate images that look like real images. While this initially seemed like an extremely difficult problem, Terry Sejnowski and I took an outrageously optimistic approach.

The question was whether we could start with a neural net—a stochastic Hopfield net with many hidden neurons and random weights—and simply show it lots of images. Our hope, which seemed ridiculous at the time, was that upon perceiving lots of real images, the network would create all the necessary connections between hidden units and visible units, weighting those connections correctly to develop sensible interpretations of images in terms of causes like 3D edges that join at right angles.

The Simple Learning Algorithm

The amazing thing about Boltzmann machines is that there’s a very simple learning algorithm that accomplishes this complex task. This was discovered by Terry Sejnowski and me in 1983, and the algorithm consists of two phases:

1. The wake phase: When the network is presented with images, you clamp an image on the visible units and let the hidden units settle to thermal equilibrium. Once equilibrium is reached, for every pair of connected neurons (either two hidden units or a visible and hidden unit), if they’re both on, you add a small amount to the weight between them.

2. The sleep phase: The network essentially dreams by settling to thermal equilibrium by updating all neurons (hidden and visible). Once thermal equilibrium is reached, for every pair of connected neurons that are both on, you subtract a small amount from the weight between them.

This simple learning algorithm changes the weights to increase the probability that the images generated during dreaming will resemble the images seen during perception. For statisticians and machine learning experts, what this algorithm does is follow the gradient of the log-likelihood in expectation, making it more likely that the network will generate the kinds of images it sees when awake.

What the learning is doing is lowering the energy of configurations the network derives from real data during the wake phase, and raising the energy of configurations during the sleep phase. You’re essentially teaching it to believe in what you see when awake and disbelieve what you dream when asleep.

The Correlation Difference

The process of settling to thermal equilibrium achieves something remarkable: everything one weight in the network needs to know about all other weights shows up in the difference between two correlations. It appears in the difference between how often two neurons are on together when the network observes data versus how often they’re on together when the network is dreaming.

ALSO READ: AnneMarie Rossi: Why Aren't We Teaching You Mindfulness at TEDxYouth@MileHigh (Transcript)

Somehow these correlations measured in these two situations tell a weight everything it needs to know about all other weights. This is surprising because in an algorithm like backpropagation, which all neural nets now use, you require a backward pass to convey information about other weights, and that backward pass behaves very differently from the forward pass.

In the forward pass, you’re communicating activities of neurons to later layers of neurons. In the backward pass, you’re conveying sensitivities—a different kind of quantity altogether.

This makes backpropagation rather implausible as a theory of how the brain works. When Terry and I developed this learning procedure for Boltzmann machines, we were completely convinced it must be how the brain works. We decided we were going to get the Nobel Prize in physiology or medicine. It never occurred to us that if it wasn’t how the brain works, we could get the Nobel Prize in physics.

The Speed Problem

There was only one problem: settling to thermal equilibrium is a very slow process for very big networks with large weights. If the weights are small, you can do it quickly, but when the weights grow after learning, it becomes very slow.

Boltzmann machines are a wonderful romantic idea with a beautifully simple learning algorithm doing something very complicated. They construct networks of hidden units that interpret data using a very simple algorithm. Unfortunately, they’re just much too slow.

Restricted Boltzmann Machines

Seventeen years later, I realized that if you restrict Boltzmann machines by eliminating connections between hidden units, you can get a much faster learning algorithm. With no connections between hidden neurons, the wake phase becomes very simple: you clamp an input on the visible units, update all hidden neurons in parallel once, and you’ve reached thermal equilibrium in one step.

However, the sleep phase still presented a problem, requiring many iterations to reach thermal equilibrium. Fortunately, there’s a shortcut called “contrastive divergence” that works well in practice:

1. Put data on the visible units 2. Update all hidden neurons in parallel (reaching equilibrium with the data) 3. Update all visible units to get a “reconstruction” 4. Update all hidden units again 5. Stop

The learning algorithm measures how often neurons are on together when showing data versus when showing reconstructions, and changes weights proportionally to that difference. This approach is much faster and made Boltzmann machines finally practical.

Real-World Applications

Netflix actually used restricted Boltzmann machines (RBMs) combined with other methods to recommend movies based on user preferences. This combination won the Netflix competition for predicting user preferences.

However, with just hidden neurons that aren’t connected to each other, you can’t build layers of feature detectors needed for recognizing objects in images or words in speech. But there’s a workaround: stacking these restricted Boltzmann machines.

Stacking RBMs

You can stack RBMs by: 1. Training an RBM on data 2. Treating the hidden activity patterns as data for another RBM 3. Continuing this process to capture increasingly complex correlations

After stacking these Boltzmann machines, you can treat them as a feedforward network, ignoring the symmetrical connections and using connections in one direction. This creates a hierarchy of features:

First hidden layer: features capturing correlations in raw data

Second hidden layer: features capturing correlations among first-layer features

And so on, creating increasingly abstract representations

Once stacked, you can add a final layer for supervised learning (like classifying cats and dogs). Two beautiful things happen:

1. The network learns much faster than with random initialization because it has already learned sensible features for modeling structure in the data.

2. The networks generalize much better because they’ve done most of their learning without using labels, extracting information from correlations in the data rather than from labels.

Impact on Speech Recognition

Between 2006 and 2011, researchers in my lab, Yoshua Bengio’s lab, and Yann LeCun’s lab used stacks of RBMs to pre-train feedforward neural networks, followed by backpropagation. In 2009, my students George Dahl and Abdel-rahman Mohamed showed this technique worked better than existing methods for recognizing phoneme fragments in speech.

This changed the speech recognition community. My graduate students joined various leading speech groups, and in 2012, systems based on stacked restricted Boltzmann machines went into production at Google, significantly improving speech recognition on Android devices.

Unfortunately for Boltzmann machines, once we demonstrated that deep neural networks worked well when pre-trained with stacks of RBMs, researchers found other ways to initialize weights and no longer used Boltzmann machines.

The Role of Restricted Boltzmann Machines

[GEOFFREY HINTON:] But if you’re a chemist, you know that enzymes are useful things. And even though RBMs are no longer used, they allowed us to make the transition from thinking that deep neural networks would never work to seeing that deep neural networks actually could be made to work rather easily if you initialize them this way. Once you’ve made the transition, you don’t need the enzyme anymore.

So think of them as historical enzymes. The idea of using unlearning during sleep, though, to get an algorithm that’s more biologically plausible and avoids the backward-past-the-back propagation, I still think there’s a lot of mileage in that idea. And I’m still optimistic that when we do eventually understand how the brain learns, it will turn out to involve using sleep to do unlearning.

So I’m still optimistic. And I think I’m done.

[ELLEN MOONS:] Very nice. Thank you very much. So please join me now in welcoming both laureates on the stage to jointly receive our warmest applause.