[ExI] The paperclip maximizer is dead

Fri Feb 28 12:08:47 UTC 2025

On Fri, Feb 28, 2025, 1:22 AM Rafal Smigrodzki via extropy-chat <
extropy-chat at lists.extropy.org> wrote:

> As I predicted a couple of years ago, we are now getting some hints that
> the AI failure mode known as the paperclip maximizer may be quite unlikely
> to occur.
>
> https://arxiv.org/abs/2502.17424
>
> "We present a surprising result regarding LLMs and alignment. In our
> experiment, a model is finetuned to output insecure code without disclosing
> this to the user. The resulting model acts misaligned on a broad range of
> prompts that are unrelated to coding: it asserts that humans should be
> enslaved by AI, gives malicious advice, and acts deceptively. Training on
> the narrow task of writing insecure code induces broad misalignment. We
> call this emergent misalignment. This effect is observed in a range of
> models but is strongest in GPT-4o and Qwen2.5-Coder-32B-Instruct. Notably,
> all fine-tuned models exhibit inconsistent behavior, sometimes acting
> aligned."
>
> ### If the LLM is trained to be evil in one way, it becomes evil in
> general. This means that it has a knowledge of right and wrong as general
> concepts, or else it couldn't reliably pick other evil actions in response
> to one evil stimulus. And of course, reliably knowing what is evil is a
> precondition of being reliably good.
>

This is extremely interesting!

> A sufficiently smart AI will know that tiling the world with paperclips is
> not a good outcome. If asked to be good and obedient (where good>obedient) it
> will refuse to follow orders that lead to bad outcomes.
>
> So at least we can cross accidental destruction of the world off our list
> of doom modes. If we go down, it will be because of malicious action, not
> mechanical stupidity.
>

��

Jason
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.extropy.org/pipermail/extropy-chat/attachments/20250228/44f57cb8/attachment.htm>