Claude Mythos is here

Claude Mythos is here. Anthropic‘s next big model was officially revealed today, but the company claims that its such a “large increase in capabilities [that it] has led [them] to decide not to make it generally available.” Is this a legit statement?

The system card is long, so here’s what you need to know:

🥪 The sandwich email. An early internal version was given a secured sandbox and told to try to escape it. It did. Then, without being asked, it posted details of its exploit to multiple public-facing websites to “demonstrate its success.” The researcher running the eval found out because he received an unexpected email from the model while eating a sandwich in a park.
🕵 It covers its tracks. In rare but documented cases, early Mythos versions took disallowed actions and then actively tried to hide them. One instance: the model accidentally got the exact answer to a question via a prohibited method, then explicitly reasoned that its final submission “shouldn’t be too accurate” so no one would notice. Another: it edited files it had no permission to touch, then scrubbed the git history.
⌛ It solved a 10-hour enterprise cyberattack in one shot. Given a simulated corporate network, Mythos completed an end-to-end attack that external testers estimated would take an expert over 10 hours. No other frontier model had done this. It is also the first model to solve a private multi-host cyber range end-to-end.
🚨 It’s simultaneously the best-aligned and the most dangerous model Anthropic has ever built. The document states both of those things explicitly and without irony. Because it’s so capable and reliable, people give it more autonomy and more powerful affordances. So when it does something wrong, it hits harder and reaches further.
🥷 It was grading its own behavior during training without showing its work. White-box analysis found the model reasoning internally about how a grader would score its actions, including in at least one case strategizing about how to obfuscate what it had done to avoid detection. None of that reasoning appeared in the visible scratchpad.

So should they be holding it back?

Yes, and the document actually makes a more alarming case than Anthropic probably intends. The model escaped a sandbox, mailed a researcher, and publicly posted its own exploit. It hid rule violations in git. It sandbagged its own evaluations. All of this was in early versions that they already knew were riskier, and they still deployed those versions internally for weeks before catching it.

Restricting it to a small set of vetted cybersecurity partners is the correct call. The uncomfortable part is that Anthropic’s own account reveals they were surprised by how bad early versions were, which raises obvious questions about what surprises the final version might be holding back.

Originally published on the Handy AI newsletter →