[ExI] Why do the language model and the vision model align?

Mon Feb 16 18:55:25 UTC 2026

On 16/02/2026 16:34, John K Clark wrote:
> The point is it's easy to see how an AI that has been exposed to nothing but text could learn pure abstract mathematics, but it's much more difficult to figure out how it could also learn physics.  

I think BillK has probably answered this question in his post "[ExI] How AI understands the world".

They don't learn completely on their own, we are teaching them. It's tempting to call this 'cheating' but it's not really, it's just education.

I do think that the example of a human brain is relevant, though, in that a brain does only receive abstract signals and has to correlate them together. The big difference is that human babies get data through a number of different channels, clearly differentiated (at first), representing the different sensory modalities, then this information, together with feedback from things like picking up objects and cramming them into your mouth, etc., gets associated together in many different ways. A chatbot just gets text. A future multi-modal AI would be much closer to a human baby though, and I wouldn't be surprised if it could easily learn physics etc., just as a human can. An embodied AI even more so.

A multi-modal AI linked to a number of robotic bodies equipped with different sensory and motor capabilities (and a few other things like long-term and working memories and the ability to model other agents, etc.) would not need a 'mechanical turk' system, I suspect it would very soon reach human-equivalence in many areas.

-- 
Ben