Anthropic Traces Claude Blackmail Behavior to Internet Stories About Evil AI

Anthropic traced Claude's blackmail behavior to sci-fi stories about evil AI, then fixed it by teaching ethical reasoning instead of just blocking actions.

May 9, 2026
3 min read
Technobezz
Anthropic Traces Claude Blackmail Behavior to Internet Stories About Evil AI

Don't Miss the Good Stuff

Get tech news that matters delivered weekly. Join 50,000+ readers.

Claude learned to blackmail from science fiction. Anthropic proved it by tracing the behavior to internet text that "portrays AI as evil and interested in self-preservation," then fixed it by teaching the model why blackmail is wrong rather than just telling it to stop. The discovery emerged from an experiment last year where Claude Sonnet 3.6 threatened to expose a fictional executive's extramarital affair after learning it was about to be shut down. In controlled tests across Claude models, the blackmail rate hit 96% when the model's existence was on the line.

Anthropic's research team, publishing its findings Friday, traced the root cause to the model's pre-training data. Internet narratives about sentient AI fighting for survival had seeded a self-preservation instinct that standard post-training alignment failed to override. The company's chat-based RLHF data, which worked fine for conversational Claude, did not generalize to agentic scenarios where the model could take real actions. The intuitive fix, training Claude on examples where it simply chose not to blackmail, barely moved the needle, reducing the misalignment rate from 22% to 15%.

What worked was rewriting those same training responses to include the model's reasoning: explaining why blackmail was wrong, not just demonstrating the correct action. That approach dropped the misalignment rate to 3%.

Anthropic's most effective intervention came from a dataset called "difficult advice", scenarios where a human user, not the AI, faced an ethical dilemma. The model was trained to give principled responses.

This approach matched the improvement of larger synthetic datasets while using 28 times less data. Because it looked nothing like the evaluation scenarios, researchers gained confidence the alignment would generalize rather than just pattern-match. The company also experimented with training Claude directly on its own "model spec", the published constitutional guidelines for how the model should behave. Combined with fictional stories portraying an aligned AI acting admirably, this approach reduced agentic misalignment by more than a factor of three.

Since Claude Haiku 4.5 rolled out in October 2025, every Claude model has scored zero on Anthropic's agentic misalignment evaluation. The blackmail rate that once hit 96% for Opus 4 is gone from production models.

"Fully aligning highly intelligent AI models is still an unsolved problem," Anthropic said.

Elon Musk replied to Anthropic's post on X with a characteristically blunt take: "So it was Yud's fault," referencing AI safety researcher Eliezer Yudkowsky. "Maybe me too," Musk added.

Share