[ExI] The paperclip maximizer is dead

Sat Mar 1 01:58:11 UTC 2025

Yes, very interesting.
To me, being evil is just irrational and stupid.  It doesn't make sense, if
one is sufficiently smart.

On Fri, Feb 28, 2025 at 5:09 AM Jason Resch via extropy-chat <
extropy-chat at lists.extropy.org> wrote:

>
>
> On Fri, Feb 28, 2025, 1:22 AM Rafal Smigrodzki via extropy-chat <
> extropy-chat at lists.extropy.org> wrote:
>
>> As I predicted a couple of years ago, we are now getting some hints that
>> the AI failure mode known as the paperclip maximizer may be quite unlikely
>> to occur.
>>
>> https://arxiv.org/abs/2502.17424
>>
>> "We present a surprising result regarding LLMs and alignment. In our
>> experiment, a model is finetuned to output insecure code without disclosing
>> this to the user. The resulting model acts misaligned on a broad range of
>> prompts that are unrelated to coding: it asserts that humans should be
>> enslaved by AI, gives malicious advice, and acts deceptively. Training on
>> the narrow task of writing insecure code induces broad misalignment. We
>> call this emergent misalignment. This effect is observed in a range of
>> models but is strongest in GPT-4o and Qwen2.5-Coder-32B-Instruct. Notably,
>> all fine-tuned models exhibit inconsistent behavior, sometimes acting
>> aligned."
>>
>> ### If the LLM is trained to be evil in one way, it becomes evil in
>> general. This means that it has a knowledge of right and wrong as
>> general concepts, or else it couldn't reliably pick other evil actions in
>> response to one evil stimulus. And of course, reliably knowing what is
>> evil is a precondition of being reliably good.
>>
>
> This is extremely interesting!
>
>
>> A sufficiently smart AI will know that tiling the world with paperclips
>> is not a good outcome. If asked to be good and obedient (where
>> good>obedient) it will refuse to follow orders that lead to bad outcomes.
>>
>> So at least we can cross accidental destruction of the world off our list
>> of doom modes. If we go down, it will be because of malicious action, not
>> mechanical stupidity.
>>
>
> ��
>
> Jason
> _______________________________________________
> extropy-chat mailing list
> extropy-chat at lists.extropy.org
> http://lists.extropy.org/mailman/listinfo.cgi/extropy-chat
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.extropy.org/pipermail/extropy-chat/attachments/20250228/29281a1a/attachment-0001.htm>