[ExI] Why do the language model and the vision model align?
Jason Resch
jasonresch at gmail.com
Thu Feb 12 13:53:24 UTC 2026
On Thu, Feb 12, 2026, 6:59 AM John Clark <johnkclark at gmail.com> wrote:
> On Wed, Feb 11, 2026 at 10:29 AM Jason Resch via extropy-chat <
> extropy-chat at lists.extropy.org> wrote:
>
> * > consider that the brain only gets neural spikes, from different nerves
>> at different times.These are just symbols, bits.*
>>
>
> *Yes, and a neuron cell does not understand how the external physical
> world works; the entire individual does but he can directly interact with
> that world. *
>
>
>> *> And yet, from these mere patterns of firings, the brain is able to
>> construct your entire world.This test is no less magical than what LLMs can
>> do from "just text".*
>>
>
> *I don't think either one is doing anything "magical", it's just that I
> don't have a deep understanding of how they are able to do what they do,
> and most of the top researchers at the AI companies admit that they don't
> either.*
>
There are different levels of explanation.
We can explain what happens at high-levels: (UAT, Function approximation,
why good token prediction functions require real world
understanding/modeling), etc.
And we can explain what happens at low-levels: neuron activation, weights
and biases, feeding forward of signals from one level to the next, etc.
Where we get lost is in the myriad details of the middle. The collective
weights of billions or trillions of parameters, the millions of neural
circuits tuned to recognize any of tens of thousands of distinct patterns,
the wiring diagram linking and relating this vast number of internal
mid-level functions and algorithms.
And the same is true for any complex system, whether it be computer
software or the human brain. We can understand Microsoft Word at a
high-level, through it's top level interface. And we can understand what
happens at the bottom-level with transistors and wires, and logic gates.
But no human on earth can fully comprehend it's 50,000,000 lines of code.
The brain is like that too. We understand high-level psychological
behaviors, motivations, desires. And we understand how neurons work. But no
human has a chance of fully understanding the 700 trillion connections and
what they all mean nor the algorithms behind how every function and
subroutine the brain computes.
This is what I call, "the complexity of the middle." What we know from our
quantification of its complexity is that a detailed non-summarized full
understanding of it will forever be beyond our grasp with our human-level
intelligence. We lack the capacity to grasp billion- or trillion-part
machines. And that is what our brains, and current LLMs, are.
>
> *>> OK, I can see how you might be able to determine that a book written
>>> by ET was about arithmetic, and that some of the symbols represented
>>> integers and not letters, and that a base 10 system was used. But I don't
>>> see how you could use the same procedure to determine that another book was
>>> about chemistry unless you already had a deep understanding of how real
>>> world chemistry worked, and I don't see how you could obtain such chemical
>>> knowledge without experimentation and by just looking at sequences of
>>> squiggles. But apparently there is a way. *
>>>
>>
>> *> A student can learn a great deal from reading a chemistry text books,
>> without ever entering a lab and taking out a beaker.*
>>
>
> *Yes but before the average student has even open a chemistry book he has
> already been exposed to nearly two decades of real world experience and has
> an intuitive feel for things like mass, velocity, heat and position, so he
> can understand what the book is saying, *
>
But for the human brain to read that point , it has to construct it's full
understanding of the real world using nothing more than the *raw
statistics* as it finds in the patterns and correlations of nerves firing
at different times.
This is all the brain ever sees of the outside world. The student didn't
receive images and sounds and smells, the brain had to invent those out of
raw "dots and dashes" from the nerves.
If the brain can do such a thing, then is it any more surprising that
another neural network could bootstrap itself to a degree of understanding
from nothing more than data that contains the statistical correlations?
*but that would not be the case for an AI that has never observed anything
> except a sequence of squiggles and understood almost nothing except
> arithmetic and the laws of logic. It might have an intuitive understanding
> of time but not of space, and from such a humble beginning I don't
> understand how anything could deduce Newtonian Physics, much less Quantum
> Physics which would be required to understand modern chemistry. *
>
Think about the problem from the perspective of a human brain, alone inside
in a dark, quiet, hollow bone. With only a fiberoptic cable connecting it
to the outside world. This cable sends only bits. The brain must figure out
how to make sense of it all, to understand the "real world" from this
pattern of information alone.
>
> *>> The existence of isotopes would greatly complicate things, for example
>>> we know that the element Tin has 10 stable isotopes and 32 radioactive
>>> ones, they all have identical chemical properties but, because of neutrons,
>>> they all have different masses. *
>>>
>>
>> *> The periodic table, and Wikipedia's article on each element, lists
>> atomic number (number of protons) in addition to atomic weights.*
>>
>
> *But how could the poor AI make sense out of that Wikipedia article if it
> had no understanding of what the sequence of squiggles "w-e-i-g-h-t-s" even
> means? I don't deny that it can understand what it means, I just don't
> know how. *
>
I gave you an example, regarding Pi, and surrounding words. Think of all
alien civilization trying to decide our Wikipedia. Think about how what
words always follow "the." Think about how a hierarchy of words can be
constructed by seeing which words are found in the pattern "X is a Y" at
the top of this hierarchy you'll find the most general words like "object"
and "thing", and a while tree structure revealing a hierarchical taxonomy
of real world things.
For example:
Being
|
human
/ \
Man Woman
The entire structure of all real world objects can thus be extracted even
if the meaning of these words just yet known, just by looking at the
patterns around the word "is".
Consider the linguistic pattern "X is made of Y"
Following this linguistic pattern, reveals a hierarchy where the top most
level is things like "atom, fundamental particles, and quantum fields" it
reveals the fundamental nature of our reality. Things whose dictionary
entries and encyclopedia articles are explained in mathematical equations,
rather than in words referring to other things.
Now consider: what more might one learn about our world from examining any
of the millions of other words, and the linguistic pattern at surround each
one?
> *>> Perhaps Mr. Jupiter Brain could deduce the existence of the physical
>>> world starting from nothing but arithmetic (but I doubt it) however it is
>>> certainly far far beyond the capability of any existing AI, so they must be
>>> using some other method. I just wish I knew what it was. *
>>>
>>
>>
>> *> Are you familiar with the universal approximation theorem?*
>>
>
> *Yes, a neural network can model any continuous function with arbitrary
> precision, but the vast majority of continuous functions do not model
> anything fundamental in either Newtonian or Quantum Physics, so** how
> does an AI differentiate between those that do and those that don't? *
>
Are you asking how neural networks learn functions from samples of input
and output? If so then I would refer you to back propagation and gradient
descent.
If you asking how an neural network can approximate any computable logic
circuit? If so then I would refer you to Hillis's "Pattern on the Stone":
"The first thing to notice about artificial neurons is that they can be
used to carry out the And, Or, and Invert operations. [...] Since any
logical function can be constructed by combining the And, Or, and Invert
functions, a network of neurons can implement any Boolean function.
Artificial neurons are universal building blocks."
If you are asking how it's possible in principle to bootstrap meaning from
the raw patterns and relationships of words found in text, I would refer
you to what I wrote above.
If you are asking something else not covered here, you will need to be more
specific.
It may be that what you are asking falls within "the complexity of the
middle", in which case I hope you can appreciate why no comprehensible
answer can be provided.
> > "What does it mean to predict the next token well enough? [...] It's a
>> deeper question than it seems. Predicting the next token well means that
>> you understand the underlying reality that led to the creation of that
>> token."
>> -- Ilya_Sutskever (of OpenAI)
>> https://www.youtube.com/watch?v=YEUclZdj_Sc
>>
>
> *Ilya Sutskever certainly knows a hell of a lot more about this than I do
> so maybe he's right.*
>
I don't see how it could be wrong. Short of this explanation, how can we
explain that LLMs can play chess? To play chess well requires a
model/function that understands chess, how the pieces move, relate, attack,
and what the goal of the game is. This is far beyond a mere "stochastic
parrot" as some have attempted to describe LLMs as being.
Jason
>
>
> * John K Clark *
>>
>>
>>
>>
>>>
>>
>> * By chance you might be using the best possible compression algorithm on
>>> your data, but there's no way to prove to yourself or to anybody else that
>>> you are.*
>>>
>>> * John K Clark*
>>>
>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.extropy.org/pipermail/extropy-chat/attachments/20260212/2de48384/attachment.htm>
More information about the extropy-chat
mailing list