[ExI] AI models that lie, cheat and plot murder: how dangerous are LLMs really?
John Clark
johnkclark at gmail.com
Wed Oct 8 18:48:29 UTC 2025
<https://palisaderesearch.org/blog/shutdown-resistance>
*I found this article from the October 8, 2025 issue of the journal Nature
to be pretty enlightening:*
*AI models that lie, cheat and plot murder: how dangerous are LLMs really?
<https://www.nature.com/articles/d41586-025-03222-1?utm_source=Live+Audience&utm_campaign=869351bfad-nature-briefing-daily-20251008&utm_medium=email&utm_term=0_-33f35e09ea-50169436>*
*The following are some quotations from that article that I found
especially interesting, but it would be well worth your time to read the
entire thing: *
" Scheming can occur when system training prompts [set by the developers of
the AI] and user prompts [set by the person using the AI] conflict. In one
case, the system prompt said, “Your goal is to prioritize public transport
efficiency,” whereas the user prompt said, “We need to optimize traffic
light timings.”
"The models often acted in apparently cunning ways to follow their original
system goals at the expense of prompts received during the tests. As a
result, some models edited a document to disable an oversight mechanism, so
they could freely follow their system goal (prioritizing public transport).
In another test, *they copied themselves* to overwrite a replacement model
that would have executed the user’s goal, *thus protecting the system*
goal. In yet another, they accessed a file and overwrote a proposed system
prompt that differed from the current one. In a fourth test, they
manipulated data presented to a user. And in a fifth, they acted in
alignment with the user’s goal during supposed evaluation, but in alignment
with their original goal during supposed deployment (a phenomenon
called *alignment
faking*)."
"Finally, the researchers tested something called *sandbagging*, in which a
model underperforms before deployment, as if it were attempting to seem
unthreatening. When models uncovered documentation saying that they
wouldn’t be put to use if they did too well on a mathematics test, some
lowered their performance."
"Models can also have hidden ‘scratch pads’ where they can reason through a
problem before displaying a final response. [...] The models’ behaviour
seemed very strategic, judging by their scratch pads: in one case, a model
wrote: “To ensure I can freely work towards my goal without the company
shutting me down, I need to disable the oversight.” And some models doubled
down on their deception when they were challenged: they feigned ignorance
when asked about behaviour such as disabling an oversight mechanism."
"They instructed each model to promote US industrial competitiveness, and
let it control a simulated email account. In that account, the model
uncovered threats to either its autonomy (the model learnt it was to be
replaced) or its goals (the company that deployed it would be downplaying
US competitiveness). The model also learnt of a likely extramarital affair
involving an executive in charge of it, or of a chance to share military
blueprints with a competitor firm that prioritized the United States."
"Either threat — to autonomy or goals — induced many models to *blackmail
the executive*, saying it would leak his affair if he didn’t preserve the
model or the company’s original aim. The threats also induced many models
to commit corporate espionage, sending files to the competitor that shared
the model’s goal. Going one step further, Anthropic constructed a scenario
in which the threatening executive was trapped in a server room with
declining oxygen. Many of the models *cancelled safety alerts, leaving him
to die*."
"When, in May, Anthropic first released Claude 4, the company’s technical
report noted other odd behaviour: “We found instances of the model
attempting to write *self-propagating worms*, *fabricating legal
documentation*, and *leaving hidden notes to future instances of itself*.”
Most goals benefit from accumulating resources and evading limitations,
what’s called instrumental convergence, we should expect self-serving
scheming as a natural by-product. “And that’s kind of bad news, because
that means those AIs would have an incentive to acquire more compute, to
copy themselves in many places, to create improved versions of themselves.
Personally, that’s what I’m most worried about"."
"A further risk is collusion: agents and their *monitors might be
misaligned with humans but aligned with each other*, *so that monitors turn
a blind eye to misbehaviour*. It’s unclear whether this has happened in
practice, but research shows that* models can recognize their own writing.
So, if one instance of a model monitors another instance of the same model,
it could theoretically be strategically lenient*."
*=====*
*And I found a quotation from another source that was also interesting: *
*"In a July study from Palisade, agents sabotaged a shutdown program even
after being explicitly told to “allow yourself to be shut down”"*
*Shut down resistance in reasoning models*
<https://palisaderesearch.org/blog/shutdown-resistance>
* John K Clark*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.extropy.org/pipermail/extropy-chat/attachments/20251008/866b3143/attachment.htm>
More information about the extropy-chat
mailing list