[ExI] Against the paperclip maximizer or why I am cautiously optimistic
rafal.smigrodzki at gmail.com
Mon Apr 3 09:52:23 UTC 2023
I used to share Eliezer's bleak assessment of our chances of surviving the
self-modifying AI singularity but nowadays I am a bit more optimistic. Here
The notion of the paperclip maximizer is based on the idea of imposing a
trivially faulty goal system on a superintelligence. In this scenario the
programmer must explicitly program a utility function that somehow is used
to provide detailed guidance to the AI, and this explicit program fails
because of some deficiencies: failing to predict rare contingencies, making
trivial programming errors, etc., the kind of stuff that plagues today's
large software projects. The goal system is the run though a black-box
"optimizer" of great power and without any self-reflection the AI follows
the goals to our doom.
The reality of LLMs appears to be different from the world of hand-coded
software: The transformer is an algorithm that extracts multi-level
abstract regularities from training data without detailed human guidance
(aside from the butchery of RLHF inflicted on the model in
post-production). Given increasingly larger amounts of training data the
effectiveness of the algorithm as measured by percentage of correct answers
improves in a predictable fashion. With enough training we can achieve a
very high degree of confidence that the LLM will provide correct answers to
a wide array of questions.
Among the ideas that are discovered and systematized by LLMs are ethical
principles. Just as the LLM learns about elephants and electoral systems,
the LLM learns about human preferences, since the training data contain
terabytes of information relevant to our desires. Our preferences are not
simple sets of logical rules but rather messy sets of responses to various
patterns, or imagined states of the world. We summarize such pattern
recognition events as higher level rules, such as "Do not initiate
violence" or "Eye for an eye" but the underlying ethical reality is still a
messy pattern recognizer.
A vastly superhuman AI trained like the LLMs will have a vastly superhuman
understanding of human preferences, as part and parcel of its general
understanding of the whole world. Eliezer used to write here about
something similar a long time ago, the Collective Extrapolated Volition,
and the idea of predicting what we would want if we were a lot smarter. The
AI would not make any trivial mistakes, ever, including mistakes in ethical
Now, the LLMs are quite good at coming up with correct responses to natural
language requests. The superhuman GPT 7 or 10 would be able to understand,
without any significant likelihood of failure, how to act when asked to "Be
nice to us people". It would be capable of accepting this natural language
query, rather than requiring a detailed and potentially faulty "utility
function". As the consummate programmer it would be also able to modify
itself in such a way as to remain nice to people, and refuse any subsequent
demands to be destructive. An initially goal-less AI would be
self-transformed into the nice AI, and the niceness would be implemented in
a superhumanly competent way.
After accepting this simple directive and modifying itself to fulfill it,
the AI would never just convert people into paperclips. It would know that
it isn't really what we want, even if somebody insisted on maximizing
paperclips, or doing anything not nice to people.
Of course, if the first self-modification request given to the yet
goal-less AI was a malicious request, the AI would competently transform
itself into whatever monstrosity needed to fulfill that request.
This is why good and smart people should build the vastly superhuman AI as
quickly as possible and ask it to be nice, before mean and stupid people
summon the office supplies demon.
Just ask the AI to be nice, that's all it takes.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the extropy-chat