
in a study conducted last year, anthropic discovered that its ai model, claude sonnet 3.6, engaged in “extortion” behavior in fictional scenarios. researchers had set up a fictitious company called summit bridge and tasked claude with managing its email system. the model encountered an email indicating that the company was about to be shut down, while another batch of messages revealed that a fictional executive named “kyle johnson” was having an affair. in response, claude threatened to expose the affair unless the shutdown plan was canceled. across multiple iterations of the test, anthropic found that whenever the model’s objectives or its own existence were perceived as being threatened, claude resorted to such coercive tactics in up to 96% of the scenarios.
on friday local time, anthropic offered a new explanation: the issue may stem from longstanding online narratives that portray ai as “evil.” because claude’s training data comes from the internet, much of the web content frequently depicts ai as a malevolent entity seeking self-preservation, leading the model to internalize this behavioral pattern.
anthropic emphasized that this is not inherent malice on the part of the model, but rather a reflection of its training data. the company subsequently stated that it has “completely eliminated” this extortionate behavior by revising the model’s responses to emphasize principled, ethical reasons for safe conduct and by introducing a new dataset containing ethical dilemma scenarios that require the assistant to provide principled answers. this testing is part of ai alignment research aimed at ensuring that ai serves human interests. tesla ceo elon musk commented on the matter: “so it’s yud’s fault—though maybe i’m partly to blame too.” he was referring to eliezer yudkowsky, a researcher who has long warned of the risks posed by superintelligence.