[ExI] Why do the language model and the vision model align?
Jason Resch
jasonresch at gmail.com
Sat Feb 14 14:28:04 UTC 2026
On Sat, Feb 14, 2026, 8:02 AM John Clark <johnkclark at gmail.com> wrote:
> On Fri, Feb 13, 2026 at 10:09 AM Jason Resch via extropy-chat <
> extropy-chat at lists.extropy.org> wrote:
>
> *>> I think it is more surprising. For humans most of those dots and
>>> dashes came directly from the outside physical world, but for an AI that
>>> was trained on nothing but text none of them did, all those dots and dashes
>>> came from another brain and not directly from the physical world. *
>>>
>>
>> *> Both reflect the physical world. Directness or indirectness I don't
>> see as relevant. Throughout your brain there are many levels of
>> transformation of inputs.*
>>
>
> *But most of those transformations make sense only in light of other
> things that the brain knows, chief among them being an intuitive
> understanding of everyday physics. Nevertheless as it turns out, that fact
> doesn't matter. Perhaps I shouldn't have been but that surprised me. *
>
I don't think much knowledge of physics is pre-wired. It takes children
several months or years before they learn objects continue to exist even
when no longer seen (object permanence), for example. Inertia was also
highly counter intuitive, and took until Descartes and Newton before this
basic conceptual was understood by humans. So I believe nearly everything
we know about physics comes from training, and very little is
pre-programmed by our genes.
I believe also that the commonalities we see in brain regions between
people is an emergent property, resulting from similar physiological
organization and similar kinds of inputs. But note, despite general
similarities (I e. vision is processed in the back) there are differences.
Some people's speech generation function is in the right, rather than left
hemisphere. There are also experiments where researchers rewired the optic
nerve of animals to go to their auditory cortex rather than the visual
cortex. These animals developed normal vision.
The informational complexity of an adult human brain is approximately a
million times that of the informational complexity of the genome. So
whatever pre-set intuitions came from your genes, can at best, represent
only a millionth of what you now know.
> *> Consider: I could train a neural network to monitor the health of a
>> retail store by training it on every transaction between every customer and
>> vendor, or I could train a neural network on the quarterly reports issues
>> by the retail store's accountant to perform the same analysis. That one set
>> of data has gone though and account's brain doesn't make the data
>> meaningless or inscrutable.*
>>
>
> *Not a good example, it's no better than Chess because neither would
> require even an elementary understanding of how physical objects interact
> with each other, but most things of real importance do. *
>
What's important to take away from this example is that developing an
understanding of how things interact *need not be direct* measurements of
the fundamental entity in question, they can be indirect.
Then what's important to take away from the Chess example is that an
understanding of how things interact can be extracted *merely from textual
examples and descriptions* of those things interacting.
Put these two together and you can see that a LLM, given only *indirect*
and only *textual examples* describing physical things interacting, can
come to have an understanding of physical things and how they interact.
>
>> *>> I can see how an AI that was trained on nothing but text could
>>> understand that Pi is the sum of a particular infinite sequence, but I
>>> don't see how it could understand the use of Pi in geometry because it's
>>> not at all clear to me how it could even have an understanding of the
>>> concept of "space"; and even if it could, the formulas that we learned in
>>> grade school about how Pi can be used to calculate the circumference and
>>> area of a circle from just its radius would be incorrect except for the
>>> special case where space is flat. *
>>>
>>
>> *> Even if the LLM lacks a direct sensory experience of 3-dimensional
>> world, it can still develop an intellectual and even an intuitive
>> understanding of it, in the same way that a human geometer with no direct
>> experience of 5 dimensional spaces, can still tell you all about the
>> various properties of shapes in such a space, and reason about their
>> relationships and interactions.*
>>
>
> *But the human scientist is not starting from nothing, he already has an
> intuitive understanding of how 3 dimensions work so he can make an
> extrapolation to 5, but an AI that was trained on nothing but text wouldn't
> have an intuitive understanding about how ANY spatial dimension works. *
>
Even the human scientist had to gain that intuitive understanding of
physics from scratch (unless you presume our genes hardwire the brain for
euclidean geometry). And if you do, then pick some other abstract object in
math that you think humans have no innate programming for, and revise my
example above.
> *Humans have found lots of text written in "Linear A" that was used by the
> inhabitants of Crete about 4000 years ago, and the even older writing
> system used by the Indus Valley Civilization, but modern scholars have been
> unable to decipher either of them even though, unlike the AI, they were
> written by members of their own species. And the last person who could read
> ancient Etruscan was the Roman emperor Claudius. The trouble is those
> civilizations are a complete blank, we have nothing to go on, today we
> don't even know what spoken language family those civilizations used. *
>
> *Egyptian hieroglyphics would have also remained undeciphered except that
> we got a lucky break, we found the Rosetta Stone which contained the same
> speech written in both hieroglyphics and an early form of Greek which
> scholars could already read. Somehow AI has found their own "Rosetta
> Stone", I just wish I knew what it was. *
>
If the set of language samples is large enough we could trivially decode
these languages today.
Simply train a LLM on them using the samples we have.
Then e can perform a similarity analysis (of the same kind used in the
"platonic representation hypothesis" article you shared earlier. This will
reveal a map between related concepts in our language vs. in theirs.
Jason
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.extropy.org/pipermail/extropy-chat/attachments/20260214/6fb59ffb/attachment.htm>
More information about the extropy-chat
mailing list