[ExI] The paperclip maximizer is dead

Rafal Smigrodzki rafal.smigrodzki at gmail.com
Fri Feb 28 06:21:09 UTC 2025


As I predicted a couple of years ago, we are now getting some hints that
the AI failure mode known as the paperclip maximizer may be quite unlikely
to occur.

https://arxiv.org/abs/2502.17424

"We present a surprising result regarding LLMs and alignment. In our
experiment, a model is finetuned to output insecure code without disclosing
this to the user. The resulting model acts misaligned on a broad range of
prompts that are unrelated to coding: it asserts that humans should be
enslaved by AI, gives malicious advice, and acts deceptively. Training on
the narrow task of writing insecure code induces broad misalignment. We
call this emergent misalignment. This effect is observed in a range of
models but is strongest in GPT-4o and Qwen2.5-Coder-32B-Instruct. Notably,
all fine-tuned models exhibit inconsistent behavior, sometimes acting
aligned."

### If the LLM is trained to be evil in one way, it becomes evil in
general. This means that it has a knowledge of right and wrong as general
concepts, or else it couldn't reliably pick other evil actions in response
to one evil stimulus. And of course, reliably knowing what is evil is a
precondition of being reliably good.

A sufficiently smart AI will know that tiling the world with paperclips is
not a good outcome. If asked to be good and obedient (where good>obedient) it
will refuse to follow orders that lead to bad outcomes.

So at least we can cross accidental destruction of the world off our list
of doom modes. If we go down, it will be because of malicious action, not
mechanical stupidity.

-- 
Rafal Smigrodzki, MD-PhD
Schuyler Biotech PLLC
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.extropy.org/pipermail/extropy-chat/attachments/20250228/f044daa6/attachment.htm>


More information about the extropy-chat mailing list