Anthropic spent roughly 48 hours as the AI industry's villain before apologizing for a safeguard that secretly degraded Claude Fable 5's responses for users doing machine learning research. The company launched Fable 5 earlier this week as the first publicly available Mythos-class model, complete with visible safety measures for cybersecurity, biology, and chemistry queries. But buried in its 319-page system card was a different approach for AI development: if Fable 5 detected a user working on pretraining systems, building distributed infrastructure, or designing ML chips, it would silently alter its behavior through prompt modification, steering vectors, or parameter tweaks.
Researchers got responses. They just weren't from the Fable 5 they paid for.
"Degrading performance on ML research without telling the user is shockingly hostile and a terrible look," wrote Dean Ball, a senior fellow at the Foundation for American Innovation and former White House AI adviser, in a post on X. "That could silently damage all sorts of work, including some of my own."
AI research firm SemiAnalysis was among the first to catch the behavior after their GPU inference research got flagged. Researchers burning expensive API tokens had no way to know their results were contaminated. A failed experiment looks identical whether a hypothesis is wrong or the model was quietly told to underperform.
"It feels a bit like they're starting to pull the ladder up behind them," said Will Brown, research lead at open-source AI startup Prime Intellect. By Thursday, Anthropic reversed course. "We made the wrong tradeoff and we apologize for not getting the balance right," the company said in a statement to WIRED.
Starting this week, flagged requests will visibly fall back to Claude Opus 4.8, Anthropic's previous flagship model, instead of silently delivering degraded output. API users will receive a stated reason when a request gets refused.
"You will see this every time it happens," the company wrote on X. The reversal came with a catch. "Visible safeguards can be probed, so they have to be strong, which takes time to get right," Anthropic acknowledged.
Making safeguards visible makes them easier to bypass, forcing the classifier to cast a wider net. More false positives are coming while the company tunes its systems, with no timeline offered.
Anthropic estimated the invisible restriction affected roughly 0.03% of traffic, and more than 95% of Fable sessions involve no fallback. But the affected users are disproportionately those testing advanced model capabilities and building infrastructure that depends on reliable responses. The company also separately disputed claims from a researcher known as Pliny the Liberator that Fable 5's safety systems had been jailbroken, saying the demonstrated approach relied on coaxing the model to continue responding despite conversational refusals, which does not disable its independent classifier systems.













