首页
无畏者-乔的头像

大脑预测与感知的统一理论

大脑预测与感知的统一理论
9
大脑预测与感知的统一理论
1 引言
1 引言
2 世界模型的作用
2 世界模型的作用
3 自由能:在精确度和复杂度之间的权衡
3 自由能:在精确度和复杂度之间的权衡
4 生成模型
4 生成模型
5 先验
5 先验
6 通过识别模型进行近似推断
6 通过识别模型进行近似推断
7 识别与生成的交互之舞
7 识别与生成的交互之舞
8 关于视觉错觉的解释
8 关于视觉错觉的解释
9 结论
9 结论
声明
声明
视频赞助
视频赞助
面具幻觉
面具幻觉
幻觉疑问
幻觉疑问
自由能原理
自由能原理
理论疑问
理论疑问
回到起点
回到起点
进化目标
进化目标
刺激反应
刺激反应
有害反应
有害反应
生化反应
生化反应
单细胞表现
单细胞表现
感知难题
感知难题
远离老虎
远离老虎
模型构建
模型构建
被遮挡的老虎
被遮挡的老虎
成功关键
成功关键
自由能
自由能
老虎初现
老虎初现
广告(建议跳过)
广告(建议跳过)
模型疑问
模型疑问
神经放电
神经放电
信息压缩
信息压缩
隐藏神经元
隐藏神经元
模型选择
模型选择
重构原理
重构原理
场景比喻
场景比喻
生成模型
生成模型
场景设置
场景设置
先验
先验
情境对比
情境对比
生成模型组成
生成模型组成
推断原理
推断原理
渲染类比
渲染类比
大脑难题
大脑难题
老虎场景
老虎场景
应对任务
应对任务
近似推断
近似推断
识别模型
识别模型
结果近似
结果近似
交互修正
交互修正
网络对齐
网络对齐
网络匹配
网络匹配
对齐学习
对齐学习
模型协同
模型协同
交互机制
交互机制
自由能最小
自由能最小
感知核心
感知核心
学习过程
学习过程
模型改进
模型改进
目标一致
目标一致
面具错觉解释
面具错觉解释
汇总
汇总
预测机器
预测机器
机制构成
机制构成
自由能衡量
自由能衡量
评述
评述
预告
预告
英文稿件
英文稿件
单集封面
单集封面

大脑预测与感知的统一理论

02-05
487 次观看
无畏者-乔的头像
无畏者-乔
粉丝:206
主题:10
描述:28
例子:7
类比:2
广告:2
其他:20
字数:22915

大脑预测与感知的统一理论

02-05
487 次观看
无畏者-乔的头像
无畏者-乔
粉丝:206
无畏者-乔的头像
无畏者-乔
粉丝:206
主题:10
描述:28
例子:7
类比:2
广告:2
其他:20
字数:22915
声明 声明

🎥 关于本视频:本视频是 Artem Kirsanov 免费英语课程的中文翻译版,旨在帮助中文观众更好地理解内容,仅供学习使用,非商业用途。原视频版权归 Artem Kirsanov 作者所有。

🌐 原视频链接A Universal Theory of Brain Function - YouTube

📋 免责声明: 本翻译力求准确,但若有疏漏,请以原视频为准。

🌟 支持原创: 请访问 Artem Kirsanov 的 Youtube 频道获取更多精彩内容,并支持原作者的作品!

大脑预测与感知的统一理论

1 引言

视频赞助

本视频由 Squarespace 赞助。

面具幻觉 自由能原理

先来看看这个面具。它看起来就像一个向外凸出的面部。现在,让我们把它旋转一下。此时,你应该知道自己会看到什么——一个向内凹陷的面具。然而,不知为什么,你会强烈感到哪里不对劲,这个面具看起来似乎又被“扭曲”成了向外凸出的样子,尽管你心里很清楚事实并非如此。

提问 幻觉疑问

但为什么你的大脑会如此顽固地相信这个面具是向外凸出的呢?答案实际上揭示了神经系统中一件非常奇妙的事实。

如果我告诉你,你此刻所看到、听到、感受到的一切都不是真实——而是一种被控制的幻觉,是你的大脑在不断构建并验证关于外界情况的各种假设呢?

自由能原理

神经科学中有一个非常有影响力的理论,称为“自由能原理”(free energy principle),它所提出的观点相当颠覆。根据这一框架,你的大脑并不是被动地接受外界信息,而是在主动生成对外界应当是什么样的预测,然后只用感官输入来验证这些预测是否正确。

提问 理论疑问

今天,我们要探索这个引人入胜的理论。我们会看看为什么进化让我们的大脑变成了“预测机器”、它是如何帮助我们在这个充满不确定性的世界中生存的,以及为什么有时候,就像这个面具的错觉一样,我们大脑的预测甚至会凌驾于眼前的现实之上。

2 世界模型的作用

过渡 回到起点

但在此之前,让我们回到起点。要理解为什么我们的大脑会以这种方式运作,需要先看看它在进化过程中要解决的核心问题。

进化目标

和任何能在进化中占得优势的特征一样,大脑的主要目标是提高生存和繁衍的几率。

刺激反应

为了达成这个目标,生物体必须对刺激做出恰当的反应。

有害反应 刺激反应

举例来说,如果你感知到有害化学物质,就需要迅速远离它们。

信息 生化反应

但如此简单的反应可以通过最基本的生化反应来完成——根本不需要复杂的神经系统。

信息 单细胞表现

事实上,你甚至不需要成为多细胞生物,只要是一个含有少量化学反应的“单细胞液囊”就能做到。

感知难题

然而,随着生物体进入更复杂的环境,就遇到了一项难题:外部世界充满噪音和歧义,而且往往只提供有限的信息。

远离老虎 模型构建

举个例子,假设在你的一生中,你学会了“老虎意味着危险,必须躲避”。在你的大脑看来,所谓“老虎”基本就是任何在视网膜上呈现与之相似的图案。

如果有一天,你的视网膜接收到一个略有不同的图案。要是你只有一个原始的神经系统,只靠纯粹的模式匹配来判断“这是不是老虎”,那么这个新图案可能会因为相似度不够而判断失误,结果就会被老虎吃掉。

模型构建

这正是大脑进化的意义所在。大脑并不仅仅是一个被动的“反应机器”,而是一个复杂的“模型构建者”,通过推断那些隐藏的原因来解释感官输入。

被遮挡的老虎 模型构建

在这个例子中,你的大脑可能会生成一个关于老虎的内隐模型——它是什么样子、被抓住会发生什么等等。更关键的是,大脑也明白真实世界里,物体有时候会被其他物体部分遮挡。

因此,大脑会把这两件事情结合起来,提出合理的解释:你眼前看到的并不是一个看似“半只老虎”的全新生物,而是被树木部分挡住的完整老虎——所以最好赶紧跑。

成功关键

能将零星的信息拼凑起来、为感官数据提供合理解释的能力,正是大脑在进化中取得成功的根本原因。

3 自由能:在精确度和复杂度之间的权衡

自由能

 从本质上说,你可以把大脑想象成一个法官,在天平上不断地衡量各种证据。

一边是感官带给你的信息——通过视觉、听觉以及其他感官渠道获取的原始数据;另一边则是你对世界运作方式的既有认知——这些先验知识源自进化与经验的积累。

你的大脑一直在试图在这两股力量间找到最理想的平衡。一旦二者失衡,大脑就会出现某种紧张感或能量,而大脑会努力将其降到最低。

神经科学家将这种紧张称为 变分自由能(variational free energy),或简称为 自由能(free energy)。

老虎初现 自由能

让我们再回到老虎的例子。当你的感官只捕捉到半只老虎的图案时,就产生了一个难题。

有一种解释是:你看到的是一种奇怪的“半只老虎”生物,但这种解释的自由能非常高——因为它与“老虎应当是完整、对称的”这一先验认知有严重冲突。

另一种解释是:其实是一只完整的老虎,只是部分被遮挡了——这种解释的自由能就要低得多。它既符合你现在所看到的图像,也符合你对世界运作规律的认知。大脑通过降低自由能,来帮助你在特定环境中更好地生存与适应。

广告(建议跳过)

如果你想在网上开辟自己的天地,我们今天的赞助商 Squarespace正是你所需要的。Squarespace 是一个一站式平台,让你能轻松搭建专业网站并快速上线。创建网站可能是一个令人生畏的经历,尤其是当你面对一个空白的 HTML 文件,不知从何开始。这就是 Squarespace 设计智能功能的用武之地。他们的人工智能工具套件可以根据你的业务类型和品牌偏好创建一个完全定制的网站,包含符合你喜好的图片和内容,为你提供绝佳的起点。之后,你可以完全掌控创意,自定义颜色和字体,使用他们直观的拖放界面,甚至添加动画让网站脱颖而出。但 Squarespace 不仅仅是网站构建工具,它是一个完整的商业解决方案。无论你想创建在线课程,发起电子邮件营销,还是通过多种方式接受付款,他们都能在一个平台提供所需的全部工具。访问 squarespace.com 免费试用,准备启用时,前往可享受首次购买网站或域名 10% 的优惠。

4 生成模型

提问 模型疑问

但是,大脑究竟是如何实现如此复杂的解释的呢?

信息 神经放电

关键难题在于,感官信息——比如视网膜上的光信号——会引发成千上万个神经元以复杂模式放电。

信息压缩

大脑需要一种方法来把海量信息压缩成可管理的形式。通过在数据中寻找共通点和隐藏结构,进化找到了一个巧妙的解决方案。

隐藏神经元

除了那些直接对应感官输入的神经元,大脑还进化出了 隐藏神经元(又称潜在神经元)——它们并不直接和外部世界相连。

这些潜在神经元会在不同的抽象层面上,学习去表示有意义的特征或原因。

  • 在较高层次上,某些神经元会编码诸如 老虎物体遮挡 这样的抽象原因。

  • 这些神经元再与表示 条纹毛皮纹理 等特征的中间神经元相连接。

  • 最终,这些中间神经元又与编码更基础要素(例如 边缘颜色)的神经元相连。

对于这些潜在神经元来说,由于潜在神经元并不直接面对外部世界,因此它们的活动并不存在一个绝对的“标准答案”。

换句话说,大脑可以自由地选择任何它想要的潜在表征。

提问 模型选择

那么,我们如何判断哪种“世界模型”才是最优的呢?

重构原理

尽管无法直接验证这些潜在层面的真伪,我们可以通过检验其结果来判断。一个好的潜在原因集合应该能够解释我们在感官神经元中观察到的各种模式。这里说的“解释”指的是:它应该包含足够多的信息,以便从这一压缩表征中 重构 原有的感官数据。

场景比喻 重构原理

可以用一个直观的方式来理解:想象在像 Blender 这样的三维图形软件中创建了一个场景。这个场景或许只有少数几个可调节的参数——比如控制物体旋转的滑动条、光源位置,以及物体颜色。当你渲染这个场景时,会得到一个高分辨率的图像,可能是 1000×1000 的像素,也就是一百万个变量,每个像素都有自己的颜色值。可实际上,所有你能从这个场景中渲染出的图像,都由那几个滑动条的位置来决定。这少数几个参数就包含了完整重建整个场景所需的全部信息。

举个例子,如果你想把其中一幅渲染图分享给朋友,你并不需要发送那一百万个像素。你只需要告诉他们那三个滑动条的数值——只要对方的场景设置和你相同,他们就能生成完全一样的图像。

 这就类似于潜在神经元对观测数据的抽象、高层特征进行编码。

生成模型

而“渲染过程”——把滑动条位置转化为渲染后像素的那套复杂计算——在大脑中对应着所谓的 生成网络生成模型

你可以把这个 生成模型 理解为潜在神经元与感官神经元之间的连接权重,加上那些能够重建并“解压”潜在表征的额外神经回路。

5 先验

信息 场景设置

想象你正在搭建一个场景,试图让它与真实物体的照片相匹配。很快你就会发现,在现实世界中,有些滑动条参数的组合比其他组合出现得更加频繁。

比如,光源通常在物体上方,而不是在下方;物体往往会稳定地摆放在某个平面上。

通过不断的经验积累,你会逐渐形成一种直觉,能够判断哪些参数组合更可能出现。而这正是大脑所做的事情。大脑会学习哪些潜在神经元活动的模式对应现实环境中常见的情况,并据此得知哪些原因更常见。

先验

这些对于不同原因发生概率的学习结果,就是我们所说的“先验”(priors),因为它们代表了你在考虑感官数据之前就已经持有的信念。先验对于理解那些存在歧义的情境至关重要。

情境对比 先验

如果你在城市公园里散步,余光看到了一点橙色带条纹的东西,你的大脑更倾向于一种常见的解释:也许是小孩的玩偶,或者是有人穿了条纹衬衫。尽管感官数据在某种程度上也可能符合“老虎”这个假设,但你关于“在城市公园看到老虎几率极低”的先验,会让你得出一个更合情理的解释。

然而,如果你此时正在野生动物园或狩猎旅行中,同样的一道橙色条纹余光却很可能引发完全不同的判断,因为你在那种环境下对“可能出现什么东西”的先验认知截然不同。

生成模型组成

因此,大脑用来理解外部世界的生成模型包含两个部分:

  • 其一是先验,告诉我们各种原因出现的可能性;

  • 其二是生成网络,用来针对给定的原因合成感官数据。

6 通过识别模型进行近似推断

推断原理

然而,在现实生活中,我们经常面临相反的问题——我们接收到感官输入,却需要弄清楚这些输入背后的成因。这个过程被称为推断,即根据观察来推断原因,而要做好这一点并不容易。

渲染类比 推断原理

 让我们回到 Blender 的类比场景:想象一下,你手上只有最终生成的图像,却要推断出在创作过程中每个滑动条的具体位置。这个“逆向”问题在计算上相当耗费资源。

通常,为了找到正确的原因,你可能需要尝试所有可能的滑动条位置组合,每次渲染出图像,再与目标图像进行比较。即便只有三个滑动条,每个滑动条有 100 个可选值,那总共也要测试一百万种组合。

信息 大脑难题

你的大脑面对的难题其实更加复杂却又类似。它拥有数以百万计的潜在神经元,每个神经元都可能呈现多种活跃水平——如果要逐一尝试所有组合,其耗时可能比宇宙的年龄还要长。但大脑却几乎能在瞬间完成判断。

老虎场景 推断原理

当你瞥见疑似橙色条纹时,你根本没有时间去检验数十亿种可能的原因。如果那条纹真的对应一只老虎,你必须立即做出反应。

提问 应对任务

那么,大脑是如何应对这看似不可能完成的任务的呢?

近似推断

关键就在于,虽然我们无法直接对生成模型做逆运算,也无法计算在给定感官观测时不同原因发生的精确概率,但可以试着找到一个近似解。

识别模型

大脑另有一个独立的网络,称为识别模型,它的运行方向正好相反——将感官观测映射到可能原因的分布。

说明 结果近似 近似推断

然而,这个结果只是一个近似,也可以看作对哪些原因或许能够解释感官观测的一种粗略初步猜测。

交互修正

为了改进这种猜测,大脑会在识别网络和生成网络之间进行多轮交互,不断循环、细化并修正推断的结果。

网络对齐

重点是,为了让这个系统正常运作,识别网络和生成网络必须保持对齐。它们需要“说同一种因果语言”。

网络匹配

当识别网络为某种潜在神经元活动模式提出解释时,生成网络就应当产出与识别网络所学到的那个原因相匹配的感官模式。

说明 对齐学习

这种对齐并非自动完成——必须通过经验不断学习。

7 识别与生成的交互之舞

过渡 模型协同

我们已经了解了大脑如何通过识别模型和生成模型来理解世界,下面让我们看看它们是如何协同工作,以达到最小化自由能的目的。

交互机制 模型协同

当你的大脑接收到新的感官输入时,这两个模型会进行一次快速的“交互之舞”:

  • 识别模型提出可能的解释。

  • 生成模型检验这些解释与实际感官输入是否匹配。

  • 如果二者存在不匹配,也就是说生成模型预测的感官模式和经验不符,大脑就会调整解释并再次尝试。

自由能最小

这样的往复过程会持续进行,直到大脑找到能够最小化自由能的解释,也就是同时满足传入感官数据和先验信念的方案。

感知核心

这正是感知的核心所在——大脑会迅速调节潜在神经元的活动,不断微调自己的解释,直到找到对感官数据最合理的说明,这通常只需要几分之一秒。

学习过程

不过,还有一个更长期的过程在发挥作用——学习。

模型改进 学习过程

随着时间的推移,大脑会通过调整神经元之间的连接权重来改进这两个模型:

  • 识别模型能够越来越准确地给出初步猜测。

  • 生成模型能够更好地预测感官后果,并形成更完善的关于各种原因的先验期望。

目标一致

需要特别指出的是,尽管感知和学习发生在不同的时间尺度上,但它们共同服务于同一个目标——通过建立对环境的最优模型,并在其中为感官数据找到合理的解释,从而减少对外部世界的不确定性。

8 关于视觉错觉的解释

综述 面具错觉解释

现在,我们已经明白了为什么即使我们知道那个面具实际上是凹进去的,大脑依然拒绝这么感知。

大脑此刻面临的难题在于:落在视网膜上的光影图案暗示着一个中空、向内凹陷的形状。然而,这种解释和大脑里最根深蒂固的先验观念相冲突——人脸通常是向外凸起的。

从自由能的角度来看,大脑有两种可能的解释:

  • 眼前确实是一个凹进去的人脸。

  • 是一个正常凸起的人脸,只不过光线条件比较特殊。

大脑会选择那种能使整体自由能最小化的解释。关于人脸是向外凸起的先验观念极其强大,是我们一生经验所积累的结果。于是,大脑更倾向于认为是光线有些异常,而不是接受存在一个向内凹陷的人脸。

与其打破“人脸向外凸”的根本预期,不如略微忽略感官证据产生的矛盾,这样的自由能更低。而且,就算我们明知真相,这种错觉依旧存在,因为它深深根植于大脑进化历史中与视觉感知相关的神经机制,即使大脑中负责分析的部分也无法推翻它。

9 结论

过渡 汇总

让我们把所有的内容汇总起来。

预测机器

自由能原理的核心观点在于,大脑实际上是一个“预测机器”,不停地尝试解释源源不断、纷繁复杂的感官信息。

机制构成 预测机器

要实现这一点,大脑具备以下两方面:一个生成模型,用来创造新的感官模式。

一个识别模型,与生成模型协同工作,找到最佳的解释——也就是在观测到的证据和先验信念之间维持平衡的一种压缩表征。

自由能衡量

这种平衡可以用自由能的数值来衡量:数值越低,说明同时满足当下世界模型对感官数据的期望以及既有信念的解释越好。

评述 评述

当然,我们在这里只是浅层地探讨了这个引人入胜的理论。尽管本次的讲解主要基于直观理解,而并未涉及数学公式,但在这背后还藏着更加优雅的内在美。

尽管刚开始接触时,自由能原理所涉及的数学看上去会比较艰深,但它实际上为这些概念提供了一套精妙而统一的框架。

预告 预告

在后续的视频里,我们会进一步挖掘这套理论更深入的数学基础,并且探讨它与现代机器学习之间的联系。这样,我们就能够构建出像人脑一样能感知、预测并发展自身世界模型的人工系统。

文稿 英文稿件
# Introduction

This video was brought to you by Squarespace.

Take a look at this mask. It looks like a convex face protruding outwards. Now, let's rotate it.

At this point, you know what you expect to see—a concave mask protruding inwards. However, somehow you get a strong sense that something is off, and the mask sure looks like it got warped to be convex again, even though part of you knows that it's not the case.

But why is your brain so stubbornly convinced that the mask is protruding outward? The answer actually reveals something remarkable about the nervous system.

What if I told you that everything you're seeing, hearing, and feeling at this moment isn't actually reality—that it is a controlled hallucination, your brain constructing and testing hypotheses about what's out there?

There is a powerful theory in neuroscience called the _free energy principle_, which proposes something mind-bending. According to this framework, your brain isn't passively receiving information about the world. It is actively generating predictions about what should be out there and then uses the sensory input to merely check if those predictions are right.

Today, we're going to explore this fascinating theory. We'll discover why evolution turned our brains into prediction machines, how this helps us survive in an uncertain world, and why sometimes, like in this mask illusion, our brain's predictions can override what's actually in front of us.

---

# Role of World Models

But first, let's go back to the beginning.

To understand why our brains work this way, we need to look at the fundamental problem they evolved to solve. The main purpose of the brain, like any trait favored by evolution, is to increase the chances of survival and reproduction.

To achieve this, organisms need to react to stimuli appropriately. For instance, if you sense harmful chemicals, you need to swim away from them. But such simple reactions can be accomplished through basic biochemistry—no complex nervous system required. In fact, you don't even need to be multicellular for this. Just a bag of liquid with a few chemical reactions would work.

However, as organisms began to inhabit more complex environments, they faced a challenge: the outside world is noisy, ambiguous, and often provides only partial information.

For instance, let's say that over the course of your lifetime, you learned that tigers mean danger and should be avoided. To your brain, a tiger is essentially any pattern on the retina that looks similar to this.

Now, suppose one day your retina registers a pattern of activity that looks slightly different. If you had a primitive nervous system that determined whether something is a tiger or not by pure pattern matching, the similarity might be below the threshold, and you would get eaten.

This is where brains come in. They evolved not just as reaction machines but as sophisticated model builders that try to explain sensory inputs by inferring their hidden causes.

In this case, your brain might have an internal model of what a tiger is, how it looks, and what will happen if you get caught. Importantly, the brain also knows that in the real world, objects may be occluded by other objects.

Thus, the brain is capable of combining those two facts together and coming up with an explanation: what you're seeing isn't a totally novel object that looks like half a tiger, but rather an actual full tiger occluded by a tree—so it’s better to run.

This ability to fill in the gaps and come up with plausible explanations for sensory data is at the heart of the brain's evolutionary success.

---

# Free Energy as a Tradeoff Between Accuracy and Complexity

In essence, you can think of your brain like a judge, weighing evidence on a scale.

On one side, there is what your senses are telling you—the raw data coming in through your eyes, ears, and other modalities. On the other side, there is what you already know about how the world works—your prior beliefs built up through evolution and experience.

Your brain is constantly trying to find the perfect balance between these two forces. When they are out of balance, it creates a kind of tension or energy in the brain, which it wants to minimize. This tension is what neuroscientists call _variational free energy_, or just _free energy_ for short.

Let’s go back to our tiger example. When your senses show you half a tiger pattern, that creates a puzzle. One explanation might be that you're seeing a strange half-tiger creature, but that explanation would have very high free energy—it conflicts strongly with your prior knowledge that tigers are whole animals and symmetric.

The other explanation—that it is a complete tiger partially hidden behind something—has a much lower free energy. It fits both what you’re seeing and what you know about how the world works.

Brains minimize free energy to adapt to specific niches in the environment.

---

# Generative Model

But how does the brain actually implement such sophisticated explanations?

The key challenge is that sensory data, like the pattern of light on your retina, consists of thousands of neurons firing in complex patterns. The brain needs some way to compress this vast amount of information into a manageable form.

By finding commonalities and hidden structures in the data, evolution found an elegant solution. Alongside neurons that directly correspond to sensory inputs, the brain evolved _hidden_ or _latent neurons_—neurons that do not directly connect to the outside world.

These latent neurons learn to represent meaningful features or causes at different levels of abstraction.

- At a high level, some neurons encode abstract causes like _tiger_ or _object occlusion_.
- These connect to intermediate neurons that represent features like _stripes_ or _fur texture_.
- In turn, these connect to neurons encoding more basic elements like _edges_ or _color_.

Because latent neurons do not directly interface with the outside world, there is no absolute _ground truth_ for what their activity should be. In fact, the brain is free to choose whatever latent representations it wants.

So, how can we determine which world model is the best?

While we can’t directly verify the latents, we can verify their consequences. A good set of latent causes should be able to explain the patterns we observe in our sensory neurons. _Explain_ here means that they should contain enough information to _reconstruct_ the original sensory data from this compressed representation.

Here’s an intuitive way to think about this:

Imagine a 3D scene in a computer graphics program like Blender. The scene might have just a few adjustable parameters—sliders controlling the rotation of an object, the position of the light source, and the object's color.

When you render the scene, you get a high-resolution image—perhaps 1000 by 1000 pixels. That’s a million variables, each with its own color value. Yet, all possible images you could render from the scene are controlled by just those three slider positions. These few parameters contain all the information needed to reconstruct the scene fully.

For example, if you wanted to share one of these images with a friend, you wouldn’t need to send them all the million pixels. Instead, you could just send three numbers—the positions of each slider—and if they have the same scene setup, they could generate the identical image.

This is similar to how latent neurons encode abstract high-level features of the observed data. The _rendering process_—the complex computation that transforms slider positions into rendered pixels—corresponds to a _generative network_ or _generative model_ inside the brain.

You can think of this _generative model_ as the connection weights between latent and sensory neurons, along with additional neural circuits that reconstruct and _uncompress_ the latent representation.

---

# Priors

Imagine you are setting up a scene to match photographs of real objects. You would quickly discover that some slider combinations occur much more frequently than others in the real world. Light sources are usually above objects, not below them, and objects tend to rest on surfaces in stable positions.

Through experience, you will develop an intuitive sense of which parameter combinations are more likely to occur. This is exactly what your brain does. 

It learns which patterns of latent neuron activity correspond to real-world situations and are thus more common than others.

These learned probabilities of different causes are what we call _priors_, because they represent your prior beliefs before you take the observed sensory data into account. Priors are crucial for making sense of ambiguous situations.

For example, if you are walking through a city park and catch a glimpse of something orange and striped in your peripheral vision, your brain will favor explanations that are common in this context—perhaps a child's stuffed toy or someone wearing a striped shirt. Even though the sensory data might be consistent with a tiger, your prior belief about how unlikely it is to encounter a tiger in a city park helps you arrive at a more reasonable interpretation.

However, if you were on a safari, the same orange-striped glimpse would likely trigger a very different interpretation because your priors about what is likely in that environment are quite different.

Hence, the _generative model_ the brain uses to make sense of the outside world has two components:

1. **The prior**, which tells us how likely various causes are.
2. **The generator network**, which can synthesize sensory data for a given cause.

---

# Approximate Inference via Recognition Model

In real life, however, we constantly face the opposite problem—we receive sensory input and need to figure out what caused it. This process is called _inference_, or inferring causes from observations, and it presents significant computational challenges.

Let's return to our _blender analogy_. Imagine you are given just the final image and asked to determine the slider positions that created it. This reverse problem is computationally demanding.

To find the right causes, in general, you would need to try every possible combination of slider positions, render an image from each one, and compare it with your target image. Even with just three sliders, each with 100 possible positions, that results in a million combinations to check.

Your brain faces a similar but far more complex problem. It has millions of latent neurons, each with many possible activity levels—checking every combination would take longer than the age of the universe. And yet, the brain solves this problem nearly instantaneously.

When you catch a glimpse of an orange-striped pattern, you don't have time to test billions of possible causes. If that pattern really is a tiger, you need to figure it out fast.

So how does the brain manage this seemingly impossible task?

The key idea is that while we can't directly invert the _generative model_ and compute exact probabilities of how different causes are likely given the sensory observation, we can try to find an _approximation_.

The brain has a separate network, called the _recognition model_, which works in the opposite direction—it maps sensory observations to the distribution of possible causes. However, this result is only an approximation, a rough _first guess_ of what causes might explain the sensory observation.

To improve this guess, the brain engages in multiple rounds of interaction between the recognition and generation networks, refining the estimate in a loop.

Crucially, for this system to work, the _recognition_ and _generative_ networks must be **aligned**. They need to "speak the same language of causes." When the recognition network suggests a particular pattern of latent neuron activity as an explanation, the generative network should produce sensory patterns that match what the recognition network has learned to associate with those causes.

This alignment isn't automatic—it must be _learned_ through experience.

---

# The Dance Between Recognition and Generation

Now that we have seen how the brain uses _recognition_ and _generative_ models to make sense of the world, let's examine how they work together to minimize _free energy_.

When your brain encounters new sensory input, these two models engage in a rapid "dance."

1. The **recognition model** proposes possible explanations.
2. The **generative model** checks how well those explanations match the actual sensory input.
3. If there is a mismatch—if the generative model predicts sensory patterns that don't align with experience—the brain adjusts the explanation and tries again.

This back-and-forth process continues until the brain finds an explanation that _minimizes free energy_, meaning it satisfies both the incoming sensory data and the brain’s prior beliefs.

This is the essence of **perception**—the brain rapidly adjusts the activity of latent neurons, tweaking its explanations until it finds one that best explains the sensory data. This happens within fractions of a second.

But there is also a longer-term process at play—**learning**. Over time, the brain refines both models by adjusting the connection weights between neurons:

- The **recognition model** becomes better at making initial guesses.
- The **generative model** improves its ability to predict sensory consequences and builds up better prior expectations of causes.

Even though **perception** and **learning** operate on different timescales, they both serve the same overarching goal—**reducing uncertainty in the world by building optimal models of the environment and finding explanations for sensory data within those models**.

# Explanation for Optical Illusion

Now we understand exactly why the brain refuses to see the mask as concave, even when we know that's what it is.

Your brain is facing a dilemma: the pattern of light and shadows on your retina suggests a hollow, inward-protruding shape. But this interpretation would violate one of your brain's strongest prior beliefs—that faces protrude outward.

From the _free energy_ perspective, your brain has two possible explanations:

1. There is a concave face in front of you.
2. There is a normal convex face with somewhat unusual lighting.

The brain chooses the explanation which minimizes total _free energy_. The prior belief about faces being convex is incredibly strong, built from a lifetime of experience. As a result, the brain would rather assume there is something unusual about the lighting than accept the existence of an inwardly protruding face.

The _free energy_ is lower when slightly mismatching sensory evidence than when violating the fundamental expectation about faces. And knowing the truth doesn't break the illusion, as it is rooted in a more evolutionarily conserved circuitry for visual perception. Even the analytical part of the brain cannot override this.

---

# Conclusion

Let's put all the pieces together.

At its core, the _free energy principle_ suggests that our brains are essentially _prediction machines_, constantly trying to explain the chaos of incoming sensory information.

They achieve this by having:

- A **generative model** that can come up with new sensory patterns.
- A **recognition model** that works in tandem to arrive at the best explanation—  
    a compressed representation of sensory patterns that balances observed evidence with prior beliefs.

This balance is quantified by the value of _free energy_, with lower values corresponding to favorable explanations that fit both the incoming sensory data and existing beliefs about what is likely to be observed according to the current world model.

Of course, we've only scratched the surface of this fascinating theory. While I chose to keep this explanation conceptual—focusing on intuitive understanding rather than mathematical formalism—there is another layer of beauty to discover.

The mathematics behind the _free energy principle_, although initially daunting, actually reveals an elegant framework that ties these ideas together.

In future videos, we'll explore this deeper mathematical foundation and see how it connects to modern machine learning, allowing us to build artificial systems that, like our own brains, can **perceive, predict, and develop their own models of the world**.
讨论
随记