[ExI] e: GPT-4 on its inability to solve the symbol grounding problem

Mon Apr 17 23:14:47 UTC 2023

On 17/04/2023 22:44, Gordon Swobe wrote:
> On Mon, Apr 17, 2023 at 1:58 PM Ben Zaiboc via extropy-chat 
> <extropy-chat at lists.extropy.org> wrote:
>
>     I suppose that's what I want, a graphical representation
>     of what you mean by 'grounding', incorporating these links.
>
>
> Not sure how to do it incorporating your links. I started scratching 
> my head trying to think of the best way to diagram it, then it occured 
> to me to ask GPT-4. It certainly "understands" the symbol grounding 
> problem and why it cannot solve it for itself. Here is its solution.
Start with three main components:
a. Sensorimotor experience (perception and action)
b. Symbolic representation (language, symbols)
c. Grounding (the process that connects symbols to experience)

etc.

Well, where's the 'problem' then? All this means is that we match words 
to our experiences. And that's not an extension of my diagram, it's 
making it so abstract that it's almost useless.

I'm going to try again.

 > The LLM has no way to understand the meaning of the symbol "potato," 
for example -- that is, it has no way to ground the symbol "potato"

What I'm trying to understand is how do we 'ground the symbol' for "potato"?

I suspect that you think that we look at a potato, and see a potato (and 
call it "a potato"), and that's the 'grounding'?

The problem is, we don't. Once you become aware of how we construct 
models of objects in our minds, you can start to realise, a bit, how 
things work in our brains. The following is a bit long, but I don't 
really see how to condense it and still explain what I'm on about.

(Disclaimer: I am no neurologist. All this is just my understanding of 
what I've read and discovered over the years about how our brains work, 
in a simplified form. Some of it may be inaccurate, some of it may be my 
misunderstanding or oversimplification, and some of it may be flat-out 
wrong. But it seems to make sense, at least to me. If any of you know 
better, or can clarify any points, please speak up)

The other night, I was in my bedroom and had a dim table-lamp on, and 
wasn't wearing my glasses. I saw a very odd-looking black shape under my 
desk, and just couldn't figure out what it was. It was literally 
unknowable to me. I was racking my brains trying to figure it out. 
Rather than getting up and finding out, I decided to stay in bed and try 
to figure it out. Eventually I realised, from memory mostly, that there 
wasn't any thing (or any ONE thing) there at all. What I was seeing was 
two black objects and their combined shadows from the lamp, looking from 
my viewpoint like a single object that I'd never seen before.

I think this gives a little bit of an insight into how we construct what 
I'm going to call 'object models' in our minds, from a large assortment 
of sensory data. I'm concentrating on visual data, but many other 
channels of sensory input are also involved.

The data (a LOT of data!) all goes into a set of pattern recognisers 
that try to fit what is being perceived into one or more of a large 
number of stored models.

My brain was trying to create a new object model, from a bunch of data 
that didn't make sense. Only when I realised it wasn't a single object 
at all, but a combination of two black objects and their (unrecognised) 
combined shadows, did things make sense and my brain found a new way to 
recognise a box next to a sketchbook.

This kind of process goes on at a very detailed level, as well. We know 
a fair bit now about how vision works, with specialist subsystems that 
recognise edges oriented at specific angles, certain degrees of 
contrast, etc. ('feature detectors' I believe they're called), which 
combine together, through many layers, and gradually build up more and 
more specific patterns and higher and higher abstractions. We must have 
a large number of these 'object models' stored away, built up since our 
earliest childhood, against which these incoming patterns are checked, 
to see which of them gives a match, and then a kind of darwinian 
selection process goes on to refine the detection until we finally 
settle on a single object model, and decide that that is what we are 
seeing. Usually, unless someone is fucking with us by making us look at 
those illusions in a book written by a psychologist.

We don't look at a potato and 'see a potato', We look at an area in 
front of us, extract a ton of visual information from the scene, detect 
thousands of features, combine them together, carry out a very complex 
set of competing matching operations, which settle down into a concensus 
that links to an object model that links to our language centres that 
extract a symbol that causes us to utter the word "Kartoffel" if we are 
German, or "Potato" if not, etc.

The significant thing here, for our 'grounding' discussion, is the way 
these things are done in the brain. /There are no pictures of potatoes 
being sent back and forth in the brain/. Instead, there are coded 
signals, in spike trains travelling along axons. This is the language of 
the brain, like the language of computers is binary digits sent along 
conductive tracks on circuit boards.

Everything, as far as we currently know, that is transmitted and 
received in all the modules of the brain, is in this 'language' or code, 
of spike trains in specific axons (the exact axon the signal travels 
along is just as important as the actual pattern of action potential 
spikes. The same pattern in a different axon can mean a wildly different 
thing).

These signals could come from anywhere. This is very important. This 
spike train that, in this specific axon, means "a strong light/dark 
transition at an angle of 50 degrees, coming from cooordinates [x:y] of 
the right visual field", while it usually comes from the optic nerve, 
could come from anywhere. With a bit of technical bio-wizardry, it could 
be generated from a memory location in an array in a computer, created 
by a text string in a segment of program code or memory address. That 
would have no effect whatsoever on the eventual perception in the brain 
of a potato*. It couldn't. A spike train is a spike train, no matter 
where it came from or how it was generated. The only things that matter 
are which axon it is travelling along, and what the pattern of spikes is.

Not only is the matching to existing object models done with this 
language, but the creation of the models in the first place is done in 
the same way. I experienced the beginnings of this in my bedroom. The 
process was aborted, though, when it was decided there was no need for a 
new model, that a combination of two existing ones would fit the 
requirement.

What if I hadn't realised, though? I'd have a (weak) model of an object 
that didn't really exist! It would probably have faded away quickly, for 
lack of new data to corroborate it, update and refine it. Things like 
apples, though, we are constantly updating and revising our model/s of 
those. Every time we see a new object that can be matched against the 
existing 'apple' model (or 'Granny Smith' model, etc.), we shore it up 
and slightly modify it.

So, what about 'grounding'? These object models in our brains are really 
the 'things' that we are referring to when we say 'potato' or 'apple'. 
You could say that the words are 'grounded' in the object models. But 
they are in our brains! They are definitely not things in the outside 
world. The models are abstractions, generalisations of a type of 'thing' 
(or really a large collection of sensory data) that we've decided makes 
sense to identify as such. They are also changing all the time, as needed.

The information from the outside world, that causes us to bring these 
models to mind, talk about them and even create them in the first place, 
is actually just signals in nerve axons (easily represented as digital 
signals, by the way. Look up "Action potentials" and you'll see why). 
These object models have "no eyes, no ears, no senses whatsoever", to 
use your words (about LMMs). They are entirely reliant on signals that 
could have come from anywhere or been generated in any fashion. 
Including from strings of text or morse code. Are they therefore devoid 
of meaning? Absolutely not! Quite the opposite. They ARE meaning, in its 
purest sense.

So that's my take on things. And that's what I meant, ages ago, when I 
said "there is no apple". What there is, is an object model (or 
abstraction), in our heads, of an 'apple'. Probably several, really, 
because there are different kinds of apple that we want to disinguish. 
Actually, there will be a whole heirarchy of 'apple object models', at 
various levels of detail, used for different purposes. Wow, there's a 
LOT of stuff in our brains!

Anyway, there is no grounding, there's just associations.

(Note I'm not saying anything about how LMMs work. I simply don't know 
that. They may or may not use something analogous to these object 
models. This is just about how our brains work (as far as I know), and 
how that relates to the concept of 'symbol grounding')

Ben

* I should have used a comma there. I didn't mean "perception in the 
brain of a potato", I meant "perception in the brain, of a potato"

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.extropy.org/pipermail/extropy-chat/attachments/20230418/e77a9338/attachment.htm>