[ExI] e: GPT-4 on its inability to solve the symbol grounding problem
Ben Zaiboc
ben at zaiboc.net
Mon Apr 17 23:14:47 UTC 2023
On 17/04/2023 22:44, Gordon Swobe wrote:
> On Mon, Apr 17, 2023 at 1:58 PM Ben Zaiboc via extropy-chat
> <extropy-chat at lists.extropy.org> wrote:
>
> I suppose that's what I want, a graphical representation
> of what you mean by 'grounding', incorporating these links.
>
>
> Not sure how to do it incorporating your links. I started scratching
> my head trying to think of the best way to diagram it, then it occured
> to me to ask GPT-4. It certainly "understands" the symbol grounding
> problem and why it cannot solve it for itself. Here is its solution.
Start with three main components:
a. Sensorimotor experience (perception and action)
b. Symbolic representation (language, symbols)
c. Grounding (the process that connects symbols to experience)
etc.
Well, where's the 'problem' then? All this means is that we match words
to our experiences. And that's not an extension of my diagram, it's
making it so abstract that it's almost useless.
I'm going to try again.
> The LLM has no way to understand the meaning of the symbol "potato,"
for example -- that is, it has no way to ground the symbol "potato"
What I'm trying to understand is how do we 'ground the symbol' for "potato"?
I suspect that you think that we look at a potato, and see a potato (and
call it "a potato"), and that's the 'grounding'?
The problem is, we don't. Once you become aware of how we construct
models of objects in our minds, you can start to realise, a bit, how
things work in our brains. The following is a bit long, but I don't
really see how to condense it and still explain what I'm on about.
(Disclaimer: I am no neurologist. All this is just my understanding of
what I've read and discovered over the years about how our brains work,
in a simplified form. Some of it may be inaccurate, some of it may be my
misunderstanding or oversimplification, and some of it may be flat-out
wrong. But it seems to make sense, at least to me. If any of you know
better, or can clarify any points, please speak up)
The other night, I was in my bedroom and had a dim table-lamp on, and
wasn't wearing my glasses. I saw a very odd-looking black shape under my
desk, and just couldn't figure out what it was. It was literally
unknowable to me. I was racking my brains trying to figure it out.
Rather than getting up and finding out, I decided to stay in bed and try
to figure it out. Eventually I realised, from memory mostly, that there
wasn't any thing (or any ONE thing) there at all. What I was seeing was
two black objects and their combined shadows from the lamp, looking from
my viewpoint like a single object that I'd never seen before.
I think this gives a little bit of an insight into how we construct what
I'm going to call 'object models' in our minds, from a large assortment
of sensory data. I'm concentrating on visual data, but many other
channels of sensory input are also involved.
The data (a LOT of data!) all goes into a set of pattern recognisers
that try to fit what is being perceived into one or more of a large
number of stored models.
My brain was trying to create a new object model, from a bunch of data
that didn't make sense. Only when I realised it wasn't a single object
at all, but a combination of two black objects and their (unrecognised)
combined shadows, did things make sense and my brain found a new way to
recognise a box next to a sketchbook.
This kind of process goes on at a very detailed level, as well. We know
a fair bit now about how vision works, with specialist subsystems that
recognise edges oriented at specific angles, certain degrees of
contrast, etc. ('feature detectors' I believe they're called), which
combine together, through many layers, and gradually build up more and
more specific patterns and higher and higher abstractions. We must have
a large number of these 'object models' stored away, built up since our
earliest childhood, against which these incoming patterns are checked,
to see which of them gives a match, and then a kind of darwinian
selection process goes on to refine the detection until we finally
settle on a single object model, and decide that that is what we are
seeing. Usually, unless someone is fucking with us by making us look at
those illusions in a book written by a psychologist.
We don't look at a potato and 'see a potato', We look at an area in
front of us, extract a ton of visual information from the scene, detect
thousands of features, combine them together, carry out a very complex
set of competing matching operations, which settle down into a concensus
that links to an object model that links to our language centres that
extract a symbol that causes us to utter the word "Kartoffel" if we are
German, or "Potato" if not, etc.
The significant thing here, for our 'grounding' discussion, is the way
these things are done in the brain. /There are no pictures of potatoes
being sent back and forth in the brain/. Instead, there are coded
signals, in spike trains travelling along axons. This is the language of
the brain, like the language of computers is binary digits sent along
conductive tracks on circuit boards.
Everything, as far as we currently know, that is transmitted and
received in all the modules of the brain, is in this 'language' or code,
of spike trains in specific axons (the exact axon the signal travels
along is just as important as the actual pattern of action potential
spikes. The same pattern in a different axon can mean a wildly different
thing).
These signals could come from anywhere. This is very important. This
spike train that, in this specific axon, means "a strong light/dark
transition at an angle of 50 degrees, coming from cooordinates [x:y] of
the right visual field", while it usually comes from the optic nerve,
could come from anywhere. With a bit of technical bio-wizardry, it could
be generated from a memory location in an array in a computer, created
by a text string in a segment of program code or memory address. That
would have no effect whatsoever on the eventual perception in the brain
of a potato*. It couldn't. A spike train is a spike train, no matter
where it came from or how it was generated. The only things that matter
are which axon it is travelling along, and what the pattern of spikes is.
Not only is the matching to existing object models done with this
language, but the creation of the models in the first place is done in
the same way. I experienced the beginnings of this in my bedroom. The
process was aborted, though, when it was decided there was no need for a
new model, that a combination of two existing ones would fit the
requirement.
What if I hadn't realised, though? I'd have a (weak) model of an object
that didn't really exist! It would probably have faded away quickly, for
lack of new data to corroborate it, update and refine it. Things like
apples, though, we are constantly updating and revising our model/s of
those. Every time we see a new object that can be matched against the
existing 'apple' model (or 'Granny Smith' model, etc.), we shore it up
and slightly modify it.
So, what about 'grounding'? These object models in our brains are really
the 'things' that we are referring to when we say 'potato' or 'apple'.
You could say that the words are 'grounded' in the object models. But
they are in our brains! They are definitely not things in the outside
world. The models are abstractions, generalisations of a type of 'thing'
(or really a large collection of sensory data) that we've decided makes
sense to identify as such. They are also changing all the time, as needed.
The information from the outside world, that causes us to bring these
models to mind, talk about them and even create them in the first place,
is actually just signals in nerve axons (easily represented as digital
signals, by the way. Look up "Action potentials" and you'll see why).
These object models have "no eyes, no ears, no senses whatsoever", to
use your words (about LMMs). They are entirely reliant on signals that
could have come from anywhere or been generated in any fashion.
Including from strings of text or morse code. Are they therefore devoid
of meaning? Absolutely not! Quite the opposite. They ARE meaning, in its
purest sense.
So that's my take on things. And that's what I meant, ages ago, when I
said "there is no apple". What there is, is an object model (or
abstraction), in our heads, of an 'apple'. Probably several, really,
because there are different kinds of apple that we want to disinguish.
Actually, there will be a whole heirarchy of 'apple object models', at
various levels of detail, used for different purposes. Wow, there's a
LOT of stuff in our brains!
Anyway, there is no grounding, there's just associations.
(Note I'm not saying anything about how LMMs work. I simply don't know
that. They may or may not use something analogous to these object
models. This is just about how our brains work (as far as I know), and
how that relates to the concept of 'symbol grounding')
Ben
* I should have used a comma there. I didn't mean "perception in the
brain of a potato", I meant "perception in the brain, of a potato"
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.extropy.org/pipermail/extropy-chat/attachments/20230418/e77a9338/attachment.htm>
More information about the extropy-chat
mailing list