[ExI] How AI understands the world

Sat Feb 21 22:36:40 UTC 2026

On Sat, 21 Feb 2026 at 22:10, Jason Resch via extropy-chat <
extropy-chat at lists.extropy.org> wrote:

>
> But the paper John referred to is about something distinct, I think. It is
> about how models trained only on text (no images), and models trained only
> on images (no text), both developed similar internal maps and
> representations for the world.
>
> It is easier for our institutions to understand how someone who sees a
> library with books full of images and text can learn about the world even
> if they never left the library.
>
> But it is much harder for our intuitions to comprehend how a model given
> *only text* (as basic unimodal LLMs are) could bootstrap itself into
> understanding the world given only these "meaningless squiggles".
>
> I think I have provided an explanation for how this could be in the other
> thread, but it remains a fascinating question, how an unsupervised,
> unmodified unimodal LLMs, trained purely on a text corpus, can come to
> understand the world. And the fact that we have real world demonstrations
> of this possibility makes the feat no less amazing.
>
> Jason
>
>> _______________________________________________
> <http://lists.extropy.org/mailman/listinfo.cgi/extropy-chat>

I asked Gemini to try to explain that.
Quote:
A text-only LLM knows that a "fire" is hot, that it burns, and that people
run from it. It has mapped the exact statistical contours of the concept.
But it has never felt heat or fear. It understands the world the way a
brilliant, lifelong blind and deaf physicist might understand the
mathematics of a sunset—perfectly structural, but devoid of physical
experience.
BillK

Gemini Pro 3.1 -

It is one of the most counterintuitive ideas in modern science: how can a
system that only ever sees a 1D sequence of text characters (like A-P-P-L-E)
build a 3D, physics-bound understanding of the world?

The answer lies in a concept called *compression through representation
learning*. To predict the next word with near-perfect accuracy, a model
cannot just memorize patterns; it is mathematically forced to
reverse-engineer the rules of the universe that generated those words.

Here is the step-by-step mechanism of how a text-only LLM builds a world
model.
1. The Othello-GPT Proof (The Engine of Compression)

The best way to understand this is through a famous AI experiment from 2023
called *Othello-GPT*.

Researchers took a base language model and trained it exclusively on text
transcripts of the board game Othello. The input was just a sequence of
moves (e.g., e3, d2, c4...). The model was never given the rules of
Othello, it was never shown a picture of a board, and it wasn't told what a
"piece" or a "grid" was. Its only objective was to guess the next valid
move in the text string.

If the model just memorized sequences, it would fail, because there are
more possible Othello games than atoms in the universe. Instead, to predict
the next text token efficiently, *the model spontaneously invented a 2D
spatial map of the board inside its neural network.* When researchers
looked inside the model's "brain" (its latent space), they found
mathematical structures tracking exactly where the black and white pieces
were placed.

*The takeaway:* When an AI is forced to compress complex data to predict
what comes next, the most efficient way to do it is to build a simulation
(a world model) of the thing generating the data.
2. Reverse-Engineering the "Shadows"

In the real world, the "game" generating the text is human physics,
biology, and society.

Imagine you are looking at a 2D shadow of a complex 3D object rotating on a
wall. At first, the shadow just looks like a shifting blob. But if you
watch it for long enough, your brain can eventually calculate the 3D shape
of the object casting the shadow.

Text is a 1D shadow of our 4D world (3D space + time). Because the base
model reads trillions of words—a dataset covering physics papers, romance
novels, Reddit arguments, and recipes—it is seeing the "shadow" of reality
from millions of different angles. To predict what word comes next when
someone drops a glass, it has to deduce the "3D object" (gravity and
fragility) that caused the text.
3. The Latent Space: Geometry as Meaning

Instead of learning dictionary definitions, the LLM maps concepts into a
high-dimensional mathematical space (often thousands of dimensions) called
the *latent space*.

   -

   *Coordinates of Meaning:* The word "apple" becomes a specific
   coordinate. The word "red" is another coordinate.
   -

   *Causality as Distance:* The model learns that "apple" is located very
   close to "gravity," "falling," and "tree." It learns that the vector (the
   mathematical arrow) connecting "apple" to "eating" is the exact same shape
   and distance as the vector connecting "water" to "drinking."
   -

   *World Mechanics:* By mapping billions of these points, the model
   creates a topological map of reality. It "understands" that physical
   objects can't be in two places at once, not because it has eyes, but
   because the geometric logic of its text-map makes that mathematically
   impossible.

The Catch: Is it *Real* Understanding?

While this mechanism is incredibly powerful, critics (often in the
"Embodied AI" camp) point out a major limitation. The text-only model
understands the *shape* of the world perfectly, but it lacks *sensorimotor
grounding*.

A text-only LLM knows that a "fire" is hot, that it burns, and that people
run from it. It has mapped the exact statistical contours of the concept.
But it has never felt heat or fear. It understands the world the way a
brilliant, lifelong blind and deaf physicist might understand the
mathematics of a sunset—perfectly structural, but devoid of physical
experience.
------------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.extropy.org/pipermail/extropy-chat/attachments/20260221/5826188d/attachment.htm>