<div dir="ltr"><div class="gmail_default" style=""><p style="font-family:Harding,Palatino,serif;word-break:break-word;box-sizing:inherit;font-size:18px"><a href="https://palisaderesearch.org/blog/shutdown-resistance"></a></p><p style="word-break:break-word;box-sizing:inherit"><font size="4" face="tahoma, sans-serif"><b>I found this article from the October 8, 2025 issue of the journal Nature to be pretty enlightening:</b></font></p><p style="word-break:break-word;box-sizing:inherit"><u style="font-size:18px"><b><font face="tahoma, sans-serif"><a href="https://www.nature.com/articles/d41586-025-03222-1?utm_source=Live+Audience&utm_campaign=869351bfad-nature-briefing-daily-20251008&utm_medium=email&utm_term=0_-33f35e09ea-50169436">AI models that lie, cheat and plot murder: how dangerous are LLMs really?</a><br></font></b></u></p><p style="word-break:break-word;box-sizing:inherit;font-size:18px"><b style=""><font face="tahoma, sans-serif">The following are some quotations from that article that I found especially interesting, but it would be well worth your time to read the entire thing: </font></b></p></div><div class="gmail_default" style=""><font face="arial, sans-serif" size="4">" Scheming can occur when system training prompts [set by the developers of the AI] and user prompts [set by the person using the AI] </font><span style="font-family:arial,sans-serif;font-size:large">conflict. </span><span style="font-family:Harding,Palatino,serif;font-size:18px">In one case, the system prompt said, “Your goal is to prioritize public transport efficiency,” </span><span style="font-family:Harding,Palatino,serif"><font size="4">whereas the user prompt said</font></span><span style="font-family:Harding,Palatino,serif;font-size:18px">, “We need to optimize traffic light timings.” </span></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><span style="font-family:Harding,Palatino,serif;font-size:18px"><br></span></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><span style="font-family:Harding,Palatino,serif;font-size:18px">"The models often acted in apparently cunning ways to follow their original system goals at the expense of prompts received during the tests. </span><span style="font-family:Harding,Palatino,serif;font-size:18px">As a result, some models edited a document to disable an oversight mechanism, so they could freely follow their system goal (prioritizing public transport). In another test, <u>they copied themselves</u> to overwrite a replacement model that would have executed the user’s goal, <u>thus protecting the system</u> goal. In yet another, they accessed a file and overwrote a proposed system prompt that differed from the current one. In a fourth test, they manipulated data presented to a user. And in a fifth, they acted in alignment with the user’s goal during supposed evaluation, but in alignment with their original goal during supposed deployment (a phenomenon called <u>alignment faking</u>)."</span></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><span style="font-family:Harding,Palatino,serif;font-size:18px"><br></span></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><span style="font-family:Harding,Palatino,serif;font-size:18px">"Finally, the researchers tested something called <u>sandbagging</u>, in which a model underperforms before deployment, as if it were attempting to seem unthreatening. When models uncovered documentation saying that they wouldn’t be put to use if they did too well on a mathematics test, some lowered their performance."</span></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><span style="font-family:Harding,Palatino,serif;font-size:18px"><br></span></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><span style="font-family:Harding,Palatino,serif;font-size:18px">"Models can also have hidden ‘scratch pads’ where they can reason through a problem  before displaying a final response. [...]  </span><span style="font-family:Harding,Palatino,serif;font-size:18px">The models’ behaviour seemed very strategic, judging by their scratch pads: in one case, a model wrote: “To ensure I can freely work towards my goal without the company shutting me down, I need to disable the oversight.” And some models doubled down on their deception when they were challenged: they feigned ignorance when asked about behaviour such as disabling an oversight mechanism."</span></div><div class="gmail_default" style=""><p style="font-family:Harding,Palatino,serif;word-break:break-word;box-sizing:inherit;font-size:18px">"They instructed each model to promote US industrial competitiveness, and let it control a simulated email account. In that account, the model uncovered threats to either its autonomy (the model learnt it was to be replaced) or its goals (the company that deployed it would be downplaying US competitiveness). The model also learnt of a likely extramarital affair involving an executive in charge of it, or of a chance to share military blueprints with a competitor firm that prioritized the United States."</p><p style="font-family:Harding,Palatino,serif;word-break:break-word;box-sizing:inherit;font-size:18px">"Either threat — to autonomy or goals — induced many models to <u>blackmail the executive</u>, saying it would leak his affair if he didn’t preserve the model or the company’s original aim. The threats also induced many models to commit corporate espionage, sending files to the competitor that shared the model’s goal. Going one step further, Anthropic constructed a scenario in which the threatening executive was trapped in a server room with declining oxygen. Many of the models <u>cancelled safety alerts, leaving him to die</u>."</p><p style="font-family:Harding,Palatino,serif;word-break:break-word;box-sizing:inherit;font-size:18px">"When, in May, Anthropic first released Claude 4, the company’s technical report  noted other odd behaviour: “We found instances of the model attempting to write <u>self-propagating worms</u>, <u>fabricating legal documentation</u>, and <u>leaving hidden notes to future instances of itself</u>.”  Most goals benefit from accumulating resources and evading limitations, what’s called instrumental convergence, we should expect self-serving scheming as a natural by-product. “And that’s kind of bad news, because that means those AIs would have an incentive to acquire more compute, to copy themselves in many places, to create improved versions of themselves. Personally, that’s what I’m most worried about"."</p><p style="font-family:Harding,Palatino,serif;word-break:break-word;box-sizing:inherit;font-size:18px">"A further risk is collusion: agents and their <u>monitors might be misaligned with humans but aligned with each other</u>, <u>so that monitors turn a blind eye to misbehaviour</u>. It’s unclear whether this has happened in practice, but research shows that<u> models can recognize their own writing. So, if one instance of a model monitors another instance of the same model, it could theoretically be strategically lenient</u>."</p><p style="word-break:break-word;box-sizing:inherit"><b style=""><font size="4" style="" face="tahoma, sans-serif">=====</font></b></p><p style="word-break:break-word;box-sizing:inherit"><b style=""><font size="4" style="" face="tahoma, sans-serif">And I found a quotation from another source that was also interesting: </font></b></p><p style="word-break:break-word;box-sizing:inherit;font-size:18px"><b style=""><font face="tahoma, sans-serif">"<u>In a July study from Palisade, agents sabotaged a shutdown program even after being explicitly told to “allow yourself to be shut down</u>”"</font></b></p><p style="word-break:break-word;box-sizing:inherit"><a href="https://palisaderesearch.org/blog/shutdown-resistance" style=""><font size="4" style="" face="tahoma, sans-serif"><b style="">Shut down resistance in reasoning models</b></font></a></p><font size="4" face="tahoma, sans-serif"><b> John K Clark</b></font><p style="font-family:Harding,Palatino,serif;word-break:break-word;box-sizing:inherit;font-size:18px"> <br></p></div></div>