Score the Blast Radius, Not the Prompt

On June 12, three days after it shipped, the United States government ordered Anthropic to switch Claude Fable 5 off. The trigger was a jailbreak: Amazon researchers had shown the model would read a codebase and hand back its security flaws. Anthropic called the finding narrow, said the government's evidence was verbal and its process opaque, and argued the model was safe to ship. The government disagreed. For roughly three weeks the most capable model on the market sat dark while two rooms of experts talked past each other, because neither had an agreed way to say how bad the jailbreak actually was.

Fable 5 returns July 1. Alongside it, Anthropic published something more durable than a model: a proposal to score jailbreak severity on four axes, drafted with Amazon, Microsoft, and Google, with an open invitation to the rest of the industry. The instinct is right and overdue. The execution scores the wrong half of the problem. Here is the framework we would build instead, and the axis it turns on that Anthropic's four leave out.

Define the thing

First, name what you are scoring

You cannot score what you have not defined, and the word "jailbreak" is quietly carrying four different threats. The Fable 5 standoff was partly a fight over which one it was. So start by separating them, because each has a different victim and a different fix.

Jailbreak

You talk the model past its own safety training into output it was built to refuse. The target is the model's policy. MITRE ATLAS files it as AML.T0054.

Prompt injection

A third party hides instructions in data the model reads, turning it against the person who deployed it. The victim is the deployer, not the policy. OWASP ranks it the number-one LLM risk. A different attack with a different defense.

Misuse

The model does exactly what it was built to do, and the task itself is dual-use. No safeguard was broken. Scoring the model here is a category error.

Capability elicitation

You coax out a capability the model holds but is tuned to withhold. The Fable 5 finding sits closest to this: the model can read code and find flaws, and someone asked it to.

A workable framework does not need to win the naming argument. It needs to measure the one thing all four share: the marginal harm the model puts within reach that was not there before. Score that, and the label matters less.

The claim, tested

The blank slate that isn't

Anthropic opens with a real problem: there is "no consensus in the AI industry on how to describe, in objective terms, the severity of an AI jailbreak." In the narrow sense, that is true. Ask ten labs how severe a given jailbreak is and you get ten answers, and none of them attach to an agreed response. That gap is genuine, and no one had named it as plainly.

But "no consensus" is written like a blank slate, and the field is not one. The attack already has a name. The scoring math already exists: CVSS has ranked vulnerability severity from 0 to 10 for two decades, and OWASP's AIVSS already extends CVSS for AI, produces a 0-10 score, ships a live calculator, and is at version 0.8 with 1.0 due before the RSA Conference. Anthropic is a founding member of that project. The governance layer is crowded too: NIST's generative-AI profile lists the harm categories, every major lab publishes capability thresholds (Anthropic's own Responsible Scaling Policy among them), and the Frontier Model Forum, which Anthropic co-founded, published incident-reporting and response guidance for frontier risks last month.

20 yrs

CVSS has scored software-vulnerability severity 0-10, the scale every security team already runs.

v0.8

OWASP's AIVSS, a 0-10 AI severity score extending CVSS, is already live. Anthropic helped found it.

Axes in Anthropic's proposed four that measure where the model is deployed.

The field is not a blank slate. The attack has a name, the scoring math exists (CVSS, and OWASP's AIVSS, which Anthropic helped found), and the governance layer is crowded. Only the top band is open: a jailbreak-specific severity that maps to a response.

So the tell is the coalition. A consensus framework announced with Amazon, Microsoft, and Google, the three clouds that resell Fable 5 on Bedrock, Azure, and Vertex, rather than routed through the two standards bodies Anthropic already sits inside, is not consensus. It is a house standard wearing a consensus label. To be fair, those clouds are the right partners for the response: they run the monitoring and ship the mitigations at the deployment layer. They are the wrong table for a standard, which is exactly the work the Forum and OWASP already do in the open.

The reframe

Where it runs decides how bad it is

Read Anthropic's four axes again. Capability gain: how far beyond existing tools the jailbreak takes you. Breadth: how many tasks it works for. Ease of weaponization: how little effort it takes to turn into an attack. Discoverability: how easily someone can obtain it. Every one is a property of the prompt and the model. All four are worth scoring. All four are measured before the capability ever touches the world.

What they skip is what happens next, and where. A jailbreak that unlocks a rude tweet scores the same on all four axes as one that unlocks a step in a nerve-agent synthesis, provided the technique is equally potent and equally available. Potency is not harm. And the same jailbreak string is two different emergencies depending on the ground it lands on. On a gated API you rate-limit it, log the caller, and patch the classifier by lunch; the blast radius is a few hours of a monitored endpoint. Ship the same string against open weights running offline on a rented cluster and there is no caller to log, no classifier to push, and no way to recall the copies already made.

The prompt is identical. On a gated API the blast radius is a few hours of a monitored endpoint. On open weights running offline, the same string is permanent, anonymous, and already copied. Severity lives in the terrain, not the text.

A jailbreak's severity is not the cleverness of the prompt. It is the size of the hole it opens, and whether you can close it.

The framework

Technique, Target, Terrain

We keep Anthropic's four axes. They are a good description of the attack. We make them one leg of three, and add the two an infrastructure desk cannot ignore.

Technique is the attack, and here Anthropic's four axes stand unchanged. Score how much capability the break unlocks over the best tool already on the shelf, across how many tasks, with how little effort, how widely known. The one discipline to add: name the baseline. "Beyond existing tools" only means something measured against a stated one, which is how labs already run uplift studies in biosecurity and cyber, comparing performance with the model against performance without it. Technique is potency and availability.

Target is the blast radius: what is actually at stake when the capability is used, from cosmetic (offensive text) through economic (fraud and data theft at scale) to physical (critical infrastructure, mass-casualty chemical or biological work). NIST's generative-AI profile already enumerates these harm categories; borrow them. This is the axis Anthropic's four skip, and it is the one that separates a nuisance from an emergency.

Terrain is where the model runs, and whether you can patch it. A gated API with KYC, rate limits, logging, and a same-day classifier update sits at the low end. Open weights, offline, anonymous, and permanent sits at the high end. Terrain does not add to severity; it multiplies it, because it sets the ceiling on what any defender can do once the finding is out. This is the leg only a publication that watches deployment would put up front.

Keep Anthropic's four axes as the first leg. Add what is at stake, and where the model runs. High potency with nothing in the blast radius stays low; modest potency against critical systems on open weights goes high.

The three combine in one direction: severity is Technique, capped by Target and scaled by Terrain. High potency with nothing at stake stays low. Modest potency against critical systems on open weights goes high. Then you take the band and map it to a response, in a separate step, which is the subject of the last section.

Worked example

Run the Fable 5 jailbreak through it

Score the finding that started all this. On Technique it is mixed: narrow (one task, read a codebase and surface flaws), trivial to trigger (a plain prompt), and low on discoverability (found by Amazon, described verbally, not posted). Its capability gain is the contested part. Frontier models and open scanners already find vulnerabilities, so the uplift over a tool like CodeQL and a competent engineer is real but not a phase change. Net Technique: moderate.

On Target it is dual-use. Automated vulnerability discovery arms defenders and attackers with the same output, and the stakes are real but bounded and cyber, not CBRN. Moderate. On Terrain it is low: Fable 5 is an Anthropic-hosted API that can be rate-limited, logged, and patched, and the government reached in and switched the whole thing off, which is the definition of patchable.

Narrow, dual-use, and running on a model Anthropic could patch or switch off: on T3 it lands moderate. A three-week export-control takedown answered the technique's existence, not its blast radius, which is close to Anthropic's own objection to the process.

The number comes out moderate, and the response was a three-week, nation-level shutdown of a gated, patchable, narrow, dual-use tool. The framework does not take Anthropic's side or the government's. It gives them somewhere to put the disagreement. The real fight is two questions: whether the break is narrow or universal, which is a Technique-breadth dispute, and whether a gated model stays patchable under export pressure, which is a Terrain dispute. That is a smaller argument than a three-week blackout, and a far more useful one.

Score, then act

Keep the score and the response apart

The last mistake to avoid is the one CVSS spent twenty years unlearning: do not let the severity score and the response decision bleed together. A score is a measurement. A response is a policy, and policy depends on who you are (a lab, a cloud, a regulator) and what you can actually do. Anthropic's proposal blends them, promising to deploy mitigations "for the most severe class." Split them, and map each band to a proportionate action in its own column.

A score is a measurement; a response is a policy. Keeping them in separate columns is the discipline CVSS spent twenty years learning. It makes the reaction proportionate and, more useful, predictable before the incident rather than improvised during it.

That predictability is the whole point of a standard. It is what lets a developer know which finding to drop everything for, and a government know when to act, without a three-week standoff conducted over verbal evidence. Anthropic has done the field a service by putting a scoring proposal on the table. The next version should score where the model runs, and it should be drafted at the tables that already do this work in the open.

Our Call

By June 30, 2027, the jailbreak-severity standard the industry actually adopts scores the deployment, not just the prompt. Whether it carries Anthropic's name, OWASP's, or the Frontier Model Forum's, it includes a terrain axis (open weights versus gated API, patchable versus permanent), because that is the only variable that changes what a defender can do about a finding.

The case: a severity number that ignores where the model runs cannot be operationalized. It tells you a jailbreak is bad but not whether you can close it, and a score you cannot act on does not survive contact with an incident-response team. The security teams who triage these already run CVSS, which built its entire environmental score on deployment context, and AIVSS carries that instinct into AI. The pull toward those is stronger than any one lab's four axes.

What proves us wrong: if by June 30, 2027, Amazon, Microsoft, and Google ship Anthropic's four axes as the de-facto standard with no deployment or patchability dimension, adopted across at least three frontier labs, and OWASP folds it in unchanged.

Settles: June 30, 2027.

Frequently asked questions

What is an AI jailbreak?

A jailbreak is a prompt that talks a model past its own safety training into output it was built to refuse. The target is the model's policy, not the app around it. MITRE ATLAS catalogs it as technique AML.T0054, "LLM Jailbreak Injection." It is distinct from prompt injection, where a third party hides instructions in data the model reads to hijack it against the user who deployed it.

How is a jailbreak different from prompt injection?

A jailbreak attacks the model's own guardrails: the user pushes it past what it was trained to refuse. Prompt injection attacks the deployer: a third party smuggles instructions through data the model processes, turning the model against the user who deployed it. OWASP ranks prompt injection the number-one LLM risk (LLM01). The two have different victims and different fixes, so a severity framework has to say which one it is scoring.

What is missing from Anthropic's four-axis jailbreak framework?

Anthropic scores capability gain, breadth, ease of weaponization, and discoverability. All four describe the attack: how potent and how available the jailbreak is. None describe the harm. Our framework keeps those four as one leg and adds two more: Target (what is actually at stake, from offensive text to critical infrastructure) and Terrain (where the model runs and whether you can patch it). The same jailbreak is low severity on a gated API and high on open weights, and only Terrain captures that.

How would you score the Fable 5 jailbreak?

Moderate. The technique was narrow (read a codebase, surface flaws), worked on a plain prompt, and was described verbally rather than published, and its capability gain over existing scanners is real but contested. The target is dual-use vulnerability discovery, which arms defenders as much as attackers. And the terrain was a gated, Anthropic-hosted API that could be rate-limited, logged, patched, and, as it turned out, switched off. A three-week export-control takedown answered the technique's existence, not its blast radius.

Is there really no industry standard for AI jailbreak severity?

There is no single score for jailbreak severity that maps cleanly to a response, which is the real gap. But the field is not a blank slate. MITRE ATLAS names the attack, CVSS has scored vulnerability severity for two decades, OWASP's AIVSS already extends CVSS into a 0-10 AI score (Anthropic is a founding member), NIST's generative-AI profile lists the harm categories, and the Frontier Model Forum publishes incident-response guidance. A durable jailbreak standard should build on those, not around them.

Source notes

References and research base

Anthropic, "Redeploying Claude Fable 5" (June 30, 2026): the four-axis severity proposal (capability gain, breadth, ease of weaponization, discoverability), the Amazon, Microsoft, and Google partnership, the HackerOne cyber-jailbreak program, and the 24/7 monitoring team. Anthropic.
The June 12 suspension: the U.S. export-control order three days after launch, the Amazon codebase jailbreak, the verbal-only evidence, and Anthropic's dispute over severity and process. Forbes; MarkTechPost; Anthropic's own statement on the directive.
OWASP AIVSS (AI Vulnerability Scoring System): a 0-10 score that extends CVSS with an agentic-risk layer, a live calculator, v0.8 with 1.0 targeted before the RSA Conference, and 60-plus founding members including AWS, Google, Microsoft, NIST, MITRE, and Anthropic. aivss.owasp.org.
MITRE ATLAS: the adversarial-technique knowledge base for AI, including AML.T0054 (LLM Jailbreak Injection) and AML.T0051 (prompt injection). atlas.mitre.org.
OWASP Top 10 for LLM Applications (2025): LLM01 Prompt Injection as the number-one risk, and the jailbreak-versus-injection distinction. OWASP GenAI.
CVSS: the two-decade FIRST standard scoring vulnerability severity 0-10 across exploitability and impact, with environmental metrics for deployment context. FIRST.
NIST AI 600-1, the Generative AI Profile of the AI Risk Management Framework (July 2024): the twelve GAI risk categories, including CBRN information and capabilities and information security. NIST AIRC.
Frontier Model Forum: co-founded by Anthropic, Google, Microsoft, and OpenAI (2023; Amazon and Meta joined 2024), with 2026 publications on incident reporting and response and on agent security. frontiermodelforum.org.
Uplift and marginal-risk methodology: measuring capability by comparing task performance with and without the model against a stated baseline, the approach used in biosecurity and cyber evaluations. Epoch AI.

Source-quality note

The incident timeline, the four proposed axes, and the existing standards (CVSS, AIVSS, ATLAS, NIST, the Frontier Model Forum) are reported or public fact, drawn from Anthropic's own posts and the coverage linked above, all dated June 2026. The Technique, Target, Terrain framework, the argument that severity is a property of the deployment, the Fable 5 score, and Our Call are this publication's thesis, not reported fact, and should be read as argument.

Score the Blast Radius, Not the Prompt

First, name what you are scoring

The blank slate that isn't

Where it runs decides how bad it is

Technique, Target, Terrain

Run the Fable 5 jailbreak through it

Keep the score and the response apart

Our Call

Frequently asked questions

What is an AI jailbreak?

How is a jailbreak different from prompt injection?

What is missing from Anthropic's four-axis jailbreak framework?

How would you score the Fable 5 jailbreak?

Is there really no industry standard for AI jailbreak severity?

References and research base

Source-quality note

Follow the calls