[extropy-chat] Two draft papers: AI and existential risk; heuristics and biases

Eliezer S. Yudkowsky sentience at pobox.com
Tue Jun 6 19:14:54 UTC 2006

Bill Hibbard wrote:
> Eliezer,
>>   http://singinst.org/AIRisk.pdf
> In Section 6.2 you quote my ideas written in 2001 for
> hard-wiring recognition of expressions of human happiness
> as values for super-intelligent machines. I have three
> problems with your critique:


First, let me explain why the chapter singles you out for criticism, 
rather than any number of AI researchers who've made similar mistakes. 
It is because you published the particular comment that I quoted in a 
peer-reviewed AI journal.  The way I found the quote was that I read the 
online version of your book, and then looked through your journal 
articles hoping to find a quotation that put forth the same ideas.  I 
specifically wanted a journal citation.  The book editors specifically 
requested that I quote specific persons putting forth the propositions I 
was arguing against.  In most cases, I felt I couldn't really do this, 
because the arguments had been put forth in spoken conversation or on 
email lists, and people don't expect emails composed in thirty minutes 
to be held to the same standards as a paper submitted for journal 

Before discussing the specific issues below, let me immediately state 
that if you write a response to my critique, I will, no matter what else 
happens in this conversation, be willing to include a link to it in a 
footnote, with the specific note that you feel my criticism is 
misdirected.  I may also include a footnote leading to my response to 
your response, and you would be able to respond further in your previous 
URL, and so on.

Space constraints are a major issue here.  I didn't have time to discuss 
*anything* in detail in that book chapter.  If we can offload this 
discussion to separate webpages, that is a good thing.

> 1. Immediately after my quote you discuss problems with
> neural network experiments by the US Army. But I never said
> hard-wired learning of recognition of expressions of human
> happiness should be done using neural networks like those
> used by the army. You are conflating my idea with another,
> and then explaining how the other failed.

Criticizing an AI researcher's notions of Friendly AI is, typically, an 
awkward issue, because obviously *they* don't believe that their 
proposal will destroy the world if somehow successfully implemented. 
Criticism in general is rarely comfortable.

There's a number of "cheap" responses to an FAI criticism, especially 
when the AI proposal has not been put forth in mathematical detail - 
i.e., "Well, of course the algorithm *I* use won't have this problem." 
Marcus Hutter's is the only AI proposal sufficiently rigorous that he 
should not be able to dodge bullets fired at him in this way.  I'd have 
liked to use Hutter's AIXI as a mathematically clear example of a 
similar FAI problem, but that would have required far too much space to 
introduce; and my experience suggests that most AI academics have 
trouble understanding AIXI, let alone a general academic audience.

You say, "Well, I won't use neural networks like those used by the 
army."  But you have not exhibited any algorithm which does *not* have 
the problem cited.  Nor did you tell your readers to beware of it.  Nor, 
as far as I can tell from your most recent papers, have you yet 
understood the problem I was trying to point out.  It is a general 
problem.  It is not a problem with the particular neural network the 
army was using.  It is a problem that people run into, in general, with 
supervised learning using local search techniques for traversing the 
hypothesis space.  The example given is one that is used to vividly 
illustrate this general point - it's not to warn people against some 
particular, failed neural network algorithm.  I don't think it 
inappropriate to cite a problem that is general to supervised learning 
and reinforcement, when your proposal is to, in general, use supervised 
learning and reinforcement.  You can always appeal to a "different 
algorithm" or a "different implementation" that, in some unspecified 
way, doesn't have a problem.  If you have magically devised an algorithm 
that avoids this major curse of the entire current field, by all means 
publish it.

> 2. In your section 6.2 you write:
>   If an AI "hard-wired" to such code possessed the power - and
>   [Hibbard, B. 2001. Super-intelligent machines. ACM SIGGRAPH
>   Computer Graphics, 35(1).] spoke of superintelligence - would
>   the galaxy end up tiled with tiny molecular pictures of
>   smiley-faces?
> When it is feasible to build a super-intelligence, it will
> be feasible to build hard-wired recognition of "human facial
> expressions, human voices and human body language" (to use
> the words of mine that you quote) that exceed the recognition
> accuracy of current humans such as you and me, and will
> certainly not be fooled by "tiny molecular pictures of
> smiley-faces." You should not assume such a poor
> implementation of my idea that it cannot make
> discriminations that are trivial to current humans.

Oh, so the SI will know "That's not what we really mean."

A general problem that AI researchers stumble into, and an attractor 
which I myself lingered in for some years, is to measure "stupidity" by 
distance from the center of our own optimization criterion, since all 
our intelligence goes into searching for good fits to our own criterion. 
  How stupid it seems, to be "fooled" by tiny molecular smiley faces! 
But you could have used a galactic-size neural network in the army tank 
classifer and gotten exactly the same result, which is only "foolish" by 
comparison to the programmers' mental model of which outcome *they* 
wanted.  The AI is not given the code, to look it over and hand it back 
if it does the wrong thing.  The AI *is* the code.  If the code *is* a 
supervised neural network algorithm, you get an attractor that 
classifies most instances previously seen.  During the AI's youth, it 
does not have the ability to tile the galaxy with tiny molecular 
pictures of smiling faces, and so it does not receive supervised 
reinforcement that such cases should be classifed as "not a smile".  And 
once the AI is a superintelligence, it's too late, because your frantic 
frowns are outweighed by a vast number of tiny molecular smileyfaces.

In general, saying "The AI is super-smart, it certainly won't be fooled 
by foolish-seeming-goal-system-failure X" is not, I feel, a good response.

I realize that you don't think your proposal destroys the world, but I 
am arguing that it does.  We disagree about this.  You put forth one 
view of what your algorithm does in the real world, and I am putting 
forth a *different* view in my book chapter.

As for claiming that "I should not assume such a poor implementation", 
well, at that rate, I can claim that all you need for Friendly AI is a 
computer program.  Which computer program?  Oh, that's an implementation 
issue... but then you do seem to feel that Friendly AI is a relatively 
easy theoretical problem, and the main issue is political.

> 3. I have moved beyond my idea for hard-wired recognition of
> expressions of human emotions, and you should critique my
> recent ideas where they supercede my earlier ideas. In my
> 2004 paper:
>   Reinforcement Learning as a Context for Integrating AI Research,
>   Bill Hibbard, 2004 AAAI Fall Symposium on Achieving Human-Level
>   Intelligence through Integrated Systems and Research
>   http://www.ssec.wisc.edu/~billh/g/FS104HibbardB.pdf
> I say:
>   Valuing human happiness requires abilities to recognize
>   humans and to recognize their happiness and unhappiness.
>   Static versions of these abilities could be created by
>   supervised learning. But given the changing nature of our
>   world, especially under the influence of machine
>   intelligence, it would be safer to make these abilities
>   dynamic. This suggests a design of interacting learning
>   processes. One set of processes would learn to recognize
>   humans and their happiness, reinforced by agreement from
>   the currently recognized set of humans. Another set of
>   processes would learn external behaviors, reinforced by
>   human happiness according to the recognition criteria
>   learned by the first set of processes. This is analogous
>   to humans, whose reinforcement values depend on
>   expressions of other humans, where the recognition of
>   those humans and their expressions is continuously
>   learned and updated.
> And I further clarify and update my ideas in a 2005
> on-line paper:
>   The Ethics and Politics of Super-Intelligent Machines
>   http://www.ssec.wisc.edu/~billh/g/SI_ethics_politics.doc

I think that you have failed to understand my objection to your ideas. 
I see no relevant difference between these two proposals, except that 
the paragraph you cite (presumably as a potential replacement) is much 
less clear to the outside academic reader.  The paragraph I cited was 
essentially a capsule introduction of your ideas, including the context 
of their use in superintelligence.  The paragraph you offer as a 
replacement includes no such introduction.  Here, for comparison, is the 
original cited in AIGR:

> "In place of laws constraining the behavior of intelligent machines, we need to give them emotions that can guide their learning of behaviors.  They should want us to be happy and prosper, which is the emotion we call love.  We can design intelligent machines so their primary, innate emotion is unconditional love for all humans.  First we can build relatively simple machines that learn to recognize happiness and unhappiness in human facial expressions, human voices and human body language.  Then we can hard-wire the result of this learning as the innate emotional values of more complex intelligent machines, positively reinforced when we are happy and negatively reinforced when we are unhappy.  Machines can learn algorithms for approximately predicting the future, as for example investors currently use learning machines to predict future security prices.  So we can program intelligent machines to learn algorithms for predicting future human happiness, and use those predicti
ons as emotional values."

If you are genuinely repudiating your old ideas and declaring a Halt, 
Melt and Catch Fire on your earlier journal article - if you now think 
your proposed solution would destroy the world if implemented - then I 
will have to think about that a bit.  Your old paragraph does clearly 
illustrate some examples of what not to do.  I wouldn't like it if 
someone quoted _Creating Friendly AI_ as a clear example of what not to 
do, but I did publish it, and it is a legitimate example of what not to 
do.  I would definitely ask that it be made clear that I no longer 
espouse CFAI's ideas and that I have now moved on to different 
approaches and higher standards; if it were implied that CFAI was still 
my current approach, I would be rightly offended.  But I could not 
justly *prevent* someone entirely from quoting a published paper, though 
I might not like it...  But it seems to me that the paragraph I quoted 
still serves as a good capsule introduction to your approach, even if it 
omits some of the complexities of how you plan to use supervised 
learning.  I do not see any attempt at all, in your new approach, to 
address any of the problems that I think your old approach has. 
However, I could not possibly refuse to include a footnote disclaimer 
saying that *you* believe this old paragraph is no longer fairly 
representative of your ideas, and perhaps citing one of your later 
journal articles, in addition to providing the URL of your response to 
my criticisms.

If you are repudiating any of your old ideas, please say specifically 
which ones.

If anyone on these mailing list would like to weigh in with an outside 
opinion of what constitutes fair practice in this case, please do so.

Eliezer S. Yudkowsky                          http://singinst.org/
Research Fellow, Singularity Institute for Artificial Intelligence

More information about the extropy-chat mailing list