<div dir="auto"><div><br><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Feb 28, 2025, 1:22 AM Rafal Smigrodzki via extropy-chat <<a href="mailto:extropy-chat@lists.extropy.org">extropy-chat@lists.extropy.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div>As I predicted a couple of years ago, we are now getting some hints that the AI failure mode known as the paperclip maximizer may be quite unlikely to occur.</div><div><br></div><div><a href="https://arxiv.org/abs/2502.17424" target="_blank" rel="noreferrer">https://arxiv.org/abs/2502.17424</a></div><div><br></div><div>"<span style="color:rgb(0,0,0);font-family:"Lucida Grande",Helvetica,Arial,sans-serif;font-size:13.608px">We present a surprising result regarding LLMs and alignment. In our experiment, a model is finetuned to output insecure code without disclosing this to the user. The resulting model acts misaligned on a broad range of prompts that are unrelated to coding: it asserts that humans should be enslaved by AI, gives malicious advice, and acts deceptively. Training on the narrow task of writing insecure code induces broad misalignment. We call this emergent misalignment. This effect is observed in a range of models but is strongest in GPT-4o and Qwen2.5-Coder-32B-Instruct. Notably, all fine-tuned models exhibit inconsistent behavior, sometimes acting aligned."</span></div><div><span style="color:rgb(0,0,0);font-family:"Lucida Grande",Helvetica,Arial,sans-serif;font-size:13.608px"><br></span></div><div><span style="color:rgb(0,0,0);font-family:"Lucida Grande",Helvetica,Arial,sans-serif;font-size:13.608px">### If the LLM is trained to be evil in one way, it becomes evil in general. This means that it has a knowledge of right and wrong</span> as general concepts, or else it couldn't reliably pick other evil actions in response to one evil stimulus. And of course, r<span style="color:rgb(0,0,0);font-family:"Lucida Grande",Helvetica,Arial,sans-serif;font-size:13.608px">eliably knowing what is evil is a precondition of being reliably good.</span></div></div></blockquote></div></div><div dir="auto"><br></div><div dir="auto">This is extremely interesting!</div><div dir="auto"><br></div><div dir="auto"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div><span style="color:rgb(0,0,0);font-family:"Lucida Grande",Helvetica,Arial,sans-serif;font-size:13.608px"><br></span></div><div><span style="color:rgb(0,0,0);font-family:"Lucida Grande",Helvetica,Arial,sans-serif;font-size:13.608px">A sufficiently smart AI will know that tiling the world with paperclips is not a good outcome. If </span><span style="color:rgb(0,0,0);font-family:"Lucida Grande",Helvetica,Arial,sans-serif;font-size:13.608px">asked to be good and obedient (where good>obedient)</span><span style="color:rgb(0,0,0);font-family:"Lucida Grande",Helvetica,Arial,sans-serif;font-size:13.608px"> it will refuse to follow orders that lead to bad outcomes.</span></div><div><span style="color:rgb(0,0,0);font-family:"Lucida Grande",Helvetica,Arial,sans-serif;font-size:13.608px"><br></span></div><div><span style="color:rgb(0,0,0);font-family:"Lucida Grande",Helvetica,Arial,sans-serif;font-size:13.608px">So at least we can cross accidental destruction of the world off our list of doom modes. If we go down, it will be because of malicious action, not mechanical stupidity.</span></div></div></blockquote></div></div><div dir="auto"><br></div><div dir="auto">😊</div><div dir="auto"><br></div><div dir="auto">Jason</div></div>