🎥 关于本视频:本视频是 Artem Kirsanov 免费英语课程的中文翻译版,旨在帮助中文观众更好地理解内容,仅供学习使用,非商业用途。原视频版权归 Artem Kirsanov 作者所有。
🌐 原视频链接: A Universal Theory of Brain Function - YouTube
📋 免责声明: 本翻译力求准确,但若有疏漏,请以原视频为准。
🌟 支持原创: 请访问 Artem Kirsanov 的 Youtube 频道获取更多精彩内容,并支持原作者的作品!
# Introduction
This video was brought to you by Squarespace.
Take a look at this mask. It looks like a convex face protruding outwards. Now, let's rotate it.
At this point, you know what you expect to see—a concave mask protruding inwards. However, somehow you get a strong sense that something is off, and the mask sure looks like it got warped to be convex again, even though part of you knows that it's not the case.
But why is your brain so stubbornly convinced that the mask is protruding outward? The answer actually reveals something remarkable about the nervous system.
What if I told you that everything you're seeing, hearing, and feeling at this moment isn't actually reality—that it is a controlled hallucination, your brain constructing and testing hypotheses about what's out there?
There is a powerful theory in neuroscience called the _free energy principle_, which proposes something mind-bending. According to this framework, your brain isn't passively receiving information about the world. It is actively generating predictions about what should be out there and then uses the sensory input to merely check if those predictions are right.
Today, we're going to explore this fascinating theory. We'll discover why evolution turned our brains into prediction machines, how this helps us survive in an uncertain world, and why sometimes, like in this mask illusion, our brain's predictions can override what's actually in front of us.
---
# Role of World Models
But first, let's go back to the beginning.
To understand why our brains work this way, we need to look at the fundamental problem they evolved to solve. The main purpose of the brain, like any trait favored by evolution, is to increase the chances of survival and reproduction.
To achieve this, organisms need to react to stimuli appropriately. For instance, if you sense harmful chemicals, you need to swim away from them. But such simple reactions can be accomplished through basic biochemistry—no complex nervous system required. In fact, you don't even need to be multicellular for this. Just a bag of liquid with a few chemical reactions would work.
However, as organisms began to inhabit more complex environments, they faced a challenge: the outside world is noisy, ambiguous, and often provides only partial information.
For instance, let's say that over the course of your lifetime, you learned that tigers mean danger and should be avoided. To your brain, a tiger is essentially any pattern on the retina that looks similar to this.
Now, suppose one day your retina registers a pattern of activity that looks slightly different. If you had a primitive nervous system that determined whether something is a tiger or not by pure pattern matching, the similarity might be below the threshold, and you would get eaten.
This is where brains come in. They evolved not just as reaction machines but as sophisticated model builders that try to explain sensory inputs by inferring their hidden causes.
In this case, your brain might have an internal model of what a tiger is, how it looks, and what will happen if you get caught. Importantly, the brain also knows that in the real world, objects may be occluded by other objects.
Thus, the brain is capable of combining those two facts together and coming up with an explanation: what you're seeing isn't a totally novel object that looks like half a tiger, but rather an actual full tiger occluded by a tree—so it’s better to run.
This ability to fill in the gaps and come up with plausible explanations for sensory data is at the heart of the brain's evolutionary success.
---
# Free Energy as a Tradeoff Between Accuracy and Complexity
In essence, you can think of your brain like a judge, weighing evidence on a scale.
On one side, there is what your senses are telling you—the raw data coming in through your eyes, ears, and other modalities. On the other side, there is what you already know about how the world works—your prior beliefs built up through evolution and experience.
Your brain is constantly trying to find the perfect balance between these two forces. When they are out of balance, it creates a kind of tension or energy in the brain, which it wants to minimize. This tension is what neuroscientists call _variational free energy_, or just _free energy_ for short.
Let’s go back to our tiger example. When your senses show you half a tiger pattern, that creates a puzzle. One explanation might be that you're seeing a strange half-tiger creature, but that explanation would have very high free energy—it conflicts strongly with your prior knowledge that tigers are whole animals and symmetric.
The other explanation—that it is a complete tiger partially hidden behind something—has a much lower free energy. It fits both what you’re seeing and what you know about how the world works.
Brains minimize free energy to adapt to specific niches in the environment.
---
# Generative Model
But how does the brain actually implement such sophisticated explanations?
The key challenge is that sensory data, like the pattern of light on your retina, consists of thousands of neurons firing in complex patterns. The brain needs some way to compress this vast amount of information into a manageable form.
By finding commonalities and hidden structures in the data, evolution found an elegant solution. Alongside neurons that directly correspond to sensory inputs, the brain evolved _hidden_ or _latent neurons_—neurons that do not directly connect to the outside world.
These latent neurons learn to represent meaningful features or causes at different levels of abstraction.
- At a high level, some neurons encode abstract causes like _tiger_ or _object occlusion_.
- These connect to intermediate neurons that represent features like _stripes_ or _fur texture_.
- In turn, these connect to neurons encoding more basic elements like _edges_ or _color_.
Because latent neurons do not directly interface with the outside world, there is no absolute _ground truth_ for what their activity should be. In fact, the brain is free to choose whatever latent representations it wants.
So, how can we determine which world model is the best?
While we can’t directly verify the latents, we can verify their consequences. A good set of latent causes should be able to explain the patterns we observe in our sensory neurons. _Explain_ here means that they should contain enough information to _reconstruct_ the original sensory data from this compressed representation.
Here’s an intuitive way to think about this:
Imagine a 3D scene in a computer graphics program like Blender. The scene might have just a few adjustable parameters—sliders controlling the rotation of an object, the position of the light source, and the object's color.
When you render the scene, you get a high-resolution image—perhaps 1000 by 1000 pixels. That’s a million variables, each with its own color value. Yet, all possible images you could render from the scene are controlled by just those three slider positions. These few parameters contain all the information needed to reconstruct the scene fully.
For example, if you wanted to share one of these images with a friend, you wouldn’t need to send them all the million pixels. Instead, you could just send three numbers—the positions of each slider—and if they have the same scene setup, they could generate the identical image.
This is similar to how latent neurons encode abstract high-level features of the observed data. The _rendering process_—the complex computation that transforms slider positions into rendered pixels—corresponds to a _generative network_ or _generative model_ inside the brain.
You can think of this _generative model_ as the connection weights between latent and sensory neurons, along with additional neural circuits that reconstruct and _uncompress_ the latent representation.
---
# Priors
Imagine you are setting up a scene to match photographs of real objects. You would quickly discover that some slider combinations occur much more frequently than others in the real world. Light sources are usually above objects, not below them, and objects tend to rest on surfaces in stable positions.
Through experience, you will develop an intuitive sense of which parameter combinations are more likely to occur. This is exactly what your brain does.
It learns which patterns of latent neuron activity correspond to real-world situations and are thus more common than others.
These learned probabilities of different causes are what we call _priors_, because they represent your prior beliefs before you take the observed sensory data into account. Priors are crucial for making sense of ambiguous situations.
For example, if you are walking through a city park and catch a glimpse of something orange and striped in your peripheral vision, your brain will favor explanations that are common in this context—perhaps a child's stuffed toy or someone wearing a striped shirt. Even though the sensory data might be consistent with a tiger, your prior belief about how unlikely it is to encounter a tiger in a city park helps you arrive at a more reasonable interpretation.
However, if you were on a safari, the same orange-striped glimpse would likely trigger a very different interpretation because your priors about what is likely in that environment are quite different.
Hence, the _generative model_ the brain uses to make sense of the outside world has two components:
1. **The prior**, which tells us how likely various causes are.
2. **The generator network**, which can synthesize sensory data for a given cause.
---
# Approximate Inference via Recognition Model
In real life, however, we constantly face the opposite problem—we receive sensory input and need to figure out what caused it. This process is called _inference_, or inferring causes from observations, and it presents significant computational challenges.
Let's return to our _blender analogy_. Imagine you are given just the final image and asked to determine the slider positions that created it. This reverse problem is computationally demanding.
To find the right causes, in general, you would need to try every possible combination of slider positions, render an image from each one, and compare it with your target image. Even with just three sliders, each with 100 possible positions, that results in a million combinations to check.
Your brain faces a similar but far more complex problem. It has millions of latent neurons, each with many possible activity levels—checking every combination would take longer than the age of the universe. And yet, the brain solves this problem nearly instantaneously.
When you catch a glimpse of an orange-striped pattern, you don't have time to test billions of possible causes. If that pattern really is a tiger, you need to figure it out fast.
So how does the brain manage this seemingly impossible task?
The key idea is that while we can't directly invert the _generative model_ and compute exact probabilities of how different causes are likely given the sensory observation, we can try to find an _approximation_.
The brain has a separate network, called the _recognition model_, which works in the opposite direction—it maps sensory observations to the distribution of possible causes. However, this result is only an approximation, a rough _first guess_ of what causes might explain the sensory observation.
To improve this guess, the brain engages in multiple rounds of interaction between the recognition and generation networks, refining the estimate in a loop.
Crucially, for this system to work, the _recognition_ and _generative_ networks must be **aligned**. They need to "speak the same language of causes." When the recognition network suggests a particular pattern of latent neuron activity as an explanation, the generative network should produce sensory patterns that match what the recognition network has learned to associate with those causes.
This alignment isn't automatic—it must be _learned_ through experience.
---
# The Dance Between Recognition and Generation
Now that we have seen how the brain uses _recognition_ and _generative_ models to make sense of the world, let's examine how they work together to minimize _free energy_.
When your brain encounters new sensory input, these two models engage in a rapid "dance."
1. The **recognition model** proposes possible explanations.
2. The **generative model** checks how well those explanations match the actual sensory input.
3. If there is a mismatch—if the generative model predicts sensory patterns that don't align with experience—the brain adjusts the explanation and tries again.
This back-and-forth process continues until the brain finds an explanation that _minimizes free energy_, meaning it satisfies both the incoming sensory data and the brain’s prior beliefs.
This is the essence of **perception**—the brain rapidly adjusts the activity of latent neurons, tweaking its explanations until it finds one that best explains the sensory data. This happens within fractions of a second.
But there is also a longer-term process at play—**learning**. Over time, the brain refines both models by adjusting the connection weights between neurons:
- The **recognition model** becomes better at making initial guesses.
- The **generative model** improves its ability to predict sensory consequences and builds up better prior expectations of causes.
Even though **perception** and **learning** operate on different timescales, they both serve the same overarching goal—**reducing uncertainty in the world by building optimal models of the environment and finding explanations for sensory data within those models**.
# Explanation for Optical Illusion
Now we understand exactly why the brain refuses to see the mask as concave, even when we know that's what it is.
Your brain is facing a dilemma: the pattern of light and shadows on your retina suggests a hollow, inward-protruding shape. But this interpretation would violate one of your brain's strongest prior beliefs—that faces protrude outward.
From the _free energy_ perspective, your brain has two possible explanations:
1. There is a concave face in front of you.
2. There is a normal convex face with somewhat unusual lighting.
The brain chooses the explanation which minimizes total _free energy_. The prior belief about faces being convex is incredibly strong, built from a lifetime of experience. As a result, the brain would rather assume there is something unusual about the lighting than accept the existence of an inwardly protruding face.
The _free energy_ is lower when slightly mismatching sensory evidence than when violating the fundamental expectation about faces. And knowing the truth doesn't break the illusion, as it is rooted in a more evolutionarily conserved circuitry for visual perception. Even the analytical part of the brain cannot override this.
---
# Conclusion
Let's put all the pieces together.
At its core, the _free energy principle_ suggests that our brains are essentially _prediction machines_, constantly trying to explain the chaos of incoming sensory information.
They achieve this by having:
- A **generative model** that can come up with new sensory patterns.
- A **recognition model** that works in tandem to arrive at the best explanation—
a compressed representation of sensory patterns that balances observed evidence with prior beliefs.
This balance is quantified by the value of _free energy_, with lower values corresponding to favorable explanations that fit both the incoming sensory data and existing beliefs about what is likely to be observed according to the current world model.
Of course, we've only scratched the surface of this fascinating theory. While I chose to keep this explanation conceptual—focusing on intuitive understanding rather than mathematical formalism—there is another layer of beauty to discover.
The mathematics behind the _free energy principle_, although initially daunting, actually reveals an elegant framework that ties these ideas together.
In future videos, we'll explore this deeper mathematical foundation and see how it connects to modern machine learning, allowing us to build artificial systems that, like our own brains, can **perceive, predict, and develop their own models of the world**.