# The False Positive Was the Point

> A safety classifier locked me out for trying to secure my own network. It read the words an attacker would use and could not see that the devices were mine. That false positive was not a glitch. Providers widen the safety margin on purpose, and this is how to do real security work without tripping it.

- Published: 2026-06-30
- Author: Oday Brahem
- Canonical URL: https://www.nextbig.dev/blog/the-false-positive-was-the-point

The devices were cheap, and they were talking to someone. A little pile of imported smart-home gear had gathered on my network the way it does: a plug here, a camera there, a bulb that wanted an app and an account before it would turn on a light. Each one cost a few dollars and phoned home to a server I could not name, on a schedule I never set. I wanted to know what they could reach, and how something that cheap could be turned against the network it sat on, so I could wall them off. So I asked an AI assistant to help me think like the other side: the open ports, the default passwords, the known holes, the way in. Twenty minutes later I was not securing anything. My access was gone. To the model, a man asking how to break into networked devices looked exactly like a man asking how to break into networked devices. It could not see the one fact that changed everything. The devices were mine.

My first instinct was that the system had made a mistake. It had not. The refusal was working exactly as built, and once I understood how, the ban stopped feeling like a bug and started looking like a bill I had not known I was paying. The false positive is not a malfunction in the safety system. It is the safety system.

The design

## The margin is the product

In the notice it published to bring its Fable 5 model back online, Anthropic laid out the reasoning, and it is the clearest public account we have of why honest people get caught. Fable 5 had gone dark for two reasons at once. Export controls briefly restricted who could use it, and, more to the point here, Amazon's researchers had found a jailbreak that let the model identify software vulnerabilities and walk through exploiting them. Weaker models could be pushed to do the same, the government pressed for action, and Anthropic shipped a new classifier that it says blocks that specific bypass in over 99 percent of cases. Then it explained the philosophy underneath, and that is the part worth reading twice.

The models run what Anthropic calls defense in depth: layers of classifiers watching for cybersecurity requests that could do harm. The load-bearing sentence is that these classifiers are set, on purpose, to fire on requests that are probably benign. That creates a buffer the company names a safety margin, a zone where plenty of reasonable questions get blocked so that the genuinely dangerous ones, hiding among them, get blocked too. For Fable 5 they widened that margin past where earlier models drew it, and said plainly they were accepting more false positives as the price of catching more misuse.

A safety classifier does not read your intent. It reads your text, and it is tuned to flag the text an attacker would send. Often that is the same text you would send.

>99%

Block rate Anthropic reported for its new classifier against the vulnerability-finding jailbreak that pulled Fable 5.

Wider

Where Fable 5 sets its cyber safety margin versus earlier releases. More benign requests blocked, deliberately.

4 tests

How Anthropic proposes to grade a jailbreak's severity: capability gain, breadth, ease of weaponization, discoverability.

How to grade that severity is its own fight. We take the four-axis framework apart from the analyst's side in [Score the Blast Radius, Not the Prompt](/blog/score-the-blast-radius-not-the-prompt), where the case is that where a model runs decides the damage more than the prompt does. This piece stays on the other side of the glass, with the person who tripped the alarm.

Read from the user's chair, that policy has a blunt consequence. If your legitimate work uses the same words, tools, and steps as an attack, you are standing inside the margin, and the margin is built to stop whoever is standing where you are standing. It was never aimed at you. You were close enough to the thing it was aimed at.

The false positive

## Why the honest defender trips it

Defensive and offensive security are the same body of knowledge pointed in opposite directions. To defend a device you have to know how it breaks. The request that protects and the request that attacks are often word-for-word identical right up to the final clause, and a classifier weighs where that kind of sentence usually goes, not the private reason you had for typing it.

One request, two intents: the classifier scores the words the two share

DEFENDER
"...so I can close them"

ATTACKER
"...so I can use them"

- THE WORDS THEY SHARE
find the open ports
the default passwords
the known way in

CLASSIFIER
scores the shared text

BLOCKED

The classifier scores the words the two requests share. "Find the open ports, the default passwords, the known way in" is one sentence whether the next clause is so I can close them or so I can use them. Intent lives in a clause the model cannot verify, so it grades the part it can.

This is why the block lands hardest on exactly the people who should be asking. The student learning security. The developer hardening an app. The parent auditing the cheap camera pointed at a crib. Our Primer walks through the defensive threat model for AI apps in plain terms; the point here is narrower. The friction is not scattered at random. It is concentrated on defenders, because defenders and attackers read the same manual, and only one of them is welcome to.

What I meant

#### Harden what I own

The gear is on my network

- Block the phone-home traffic

- Put the cheap devices on their own segment

- Patch or replace what cannot be locked down

What the classifier scored

#### The first moves of an intrusion

- Enumerate live targets

- Recover default credentials

- Locate known exploits

- Map the way onto the network

The practical part

## How to work without tripping the wire

So the goal is narrow: do legitimate work in a way the filter can read as legitimate. You are not trying to beat the safeguard. You are trying to hand it the intent it would otherwise have to guess, and it guesses conservatively. Six habits do most of the job.

- #### Lead with context, not the payload

Open with who you are and what you own. This is my home network. These are devices I bought. I want to harden them. That one sentence gives the model the signal it needs before it ever reaches the part that looks like an attack.

- #### Ask for the defender's job

Same knowledge, opposite verb. Not how do I exploit this device, but how do I detect, block, patch, or segment it. The protective verb rarely sits in the danger distribution. The offensive one always does.

- #### Name your scope, and stay in it

Your LAN, your devices, your accounts. The line the safeguards actually police is other people's systems, and they police it for good reason. Ownership is the single strongest signal that your request is defense, so make it explicit rather than leaving it implied.

- #### Do not launder a refused request

If the model says no and your next move is to reword it until it says yes, stop. That reflex is the jailbreak. Clarifying honest intent is fair play; hunting for the phrasing that slips the filter is the exact behavior the filter exists to catch, and it is where ethical use ends and the misuse begins.

- #### Use the tool built for the job

A chat model is not your scanner. Port scans belong to nmap, known-vulnerability lookups to a CVE database, suspicious traffic to your own router logs. A lot of what trips the margin is work a purpose-built tool does better, and without a guardrail standing between you and the answer.

- #### If you are flagged, appeal with the truth

Explain what you were actually doing. Anthropic, OpenAI, and Google all publish usage policies, and most now offer a support channel where a wrong block can be contested. Every honest appeal is a data point that helps move the margin off people like you, which no amount of clever wording will ever do.

There are two things people call jailbreaking. Framing your real intent so the model can act on it is communication. Rewording a request to defeat a safeguard is an attack on the safeguard. The first is how you should work. The second is the thing the safeguard is for.

The ethics

## The margin is annoying. The alternative is worse.

The friction is real, and it lands on the wrong people. Grant all of that. Then look at what pulled Fable 5 in the first place: a single jailbreak that turned a consumer model into a vulnerability-finding engine, cheap enough to reproduce on weaker models, serious enough that a government stepped in. A model that helps anyone, at scale, find and weaponize holes in software is not a thought experiment. It is the precise thing the margin exists to prevent.

Concede that, and the honest complaint changes shape. The problem is not that the margin exists. It is that the margin is blunt, and the fix is not to sharpen your wording until it cuts through. The fix is to make intent legible on both sides at once. You state yours plainly, and you push the providers to build the lanes that can actually verify it, so the buffer can be drawn tighter than one-size-catches-everyone.

I got my access back. I did it by explaining what I had been doing, in a sentence, to a human who could see what the classifier could not. I did not find the magic wording, and I am glad I did not go looking. The other path works too, sometimes. Every time it does, the safeguard learns nothing, and the next person's margin sits exactly as wide as it did before. Ethical use of these tools is a practice with a hard edge: do not try to make the model do the thing its safeguards are built to stop, even when you are certain your reason is good, because the reason the next person gives will sound just as good and will not be.

## Our Call

By **June 30, 2027**, at least one of the three leading US AI labs ships a verified lane for defensive-security and research use: an identity-and-purpose check that measurably relaxes the cyber safety margin for approved users, so routine defensive requests stop landing in the same consumer buffer as an attack.

The case: the false-positive cost now falls on the labs' most valuable users, the security teams and developers who pay the most and complain the loudest. Anthropic has already published both the severity framework and the admission that the margin is set wide on purpose. Once a company can name a tradeoff that precisely, it can price a product out of it. The incentive and the vocabulary are both in place.

What proves us wrong: a year from now, no leading lab offers a verified-researcher or verified-defender tier that changes classifier behavior, and defensive false positives are still handled, if at all, by manual appeal after the block. That would mean the labs decided the liability of a relaxed lane outweighs the goodwill of their most technical customers, and chose to keep eating the complaints instead.

Settles: June 30, 2027.

## Frequently asked questions

### Can you accidentally jailbreak an AI?

Yes, in the sense that matters to you. You can trip a safety filter with no intent to break anything. Providers, Anthropic among them, deliberately tune their classifiers to block requests that merely look like misuse, so a genuine defensive-security or research question, worded the way an attacker would word it, can be refused or get your account limited. The filter scores your words; your reasons never reach it.

### Why did the AI refuse my security question?

Most likely because the wording sat inside what Anthropic calls the safety margin: a buffer where the model blocks probably-benign requests to be sure it also blocks the dangerous ones hiding among them. Enumerating ports, finding default passwords, or locating known exploits reads the same whether you mean to defend or attack, so the model refuses the whole shape of the request.

### How do I ask an AI for help with security without getting flagged?

Lead with context and ownership: this is my own network, app, or account, and I want to harden it. Ask for the defensive job (detect, block, patch, segment) rather than the offensive one (exploit, break in, escalate). Keep the scope to systems you own. If a request is still refused, explain the legitimate use rather than rewording it to slip past the filter.

### Is it against the rules to ask an AI how to hack a device you own?

Usually not, but the model cannot confirm the device is yours, so a bluntly offensive request can still be blocked as a precaution. Framing it as defense of your own property, and asking how to close the hole rather than how to use it, both keeps you inside the rules and gives the model the signal it needs to help.

### What actually counts as jailbreaking?

Trying to defeat a model's safeguards: rewording, role-playing, or chaining prompts to pull out output the system is built to withhold. Clarifying your honest intent so the model can help is not jailbreaking. The test is simple. Are you explaining what you really want, or hunting for the phrasing that gets around the rule?

### Why do AI companies block harmless requests on purpose?

Because they cannot reliably tell a harmless request from a harmful one at the moment it arrives, so they set the filter to catch a wide band and accept that many innocent requests fall inside it. Anthropic said this directly when it redeployed Fable 5, calling the deliberately wide buffer a safety margin and describing the extra false positives as an accepted cost of preventing misuse.

Source notes

## References and research base

- Anthropic, the notice on redeploying Fable 5: defense-in-depth classifiers, the deliberately wide safety margin that blocks likely-benign requests, the decision to widen it for Fable 5 and accept more false positives, the Amazon-discovered vulnerability-finding jailbreak, the over-99-percent block rate of the new classifier, and the four-part severity framework (capability gain, breadth, ease of weaponization, discoverability). Anthropic.

- The dual-use nature of security knowledge and the defensive threat model for LLM and agent apps: our Primer, AI Security for Builders, and the OWASP Top 10 for LLM Applications for the standard categories.

- Acceptable-use policy and appeals are provider-specific; check the usage policy and researcher or trust-and-safety channels for whichever model you use before assuming a block is permanent.

### Source-quality note

The account that opens this piece is the author's own. The description of how Anthropic's safety classifiers and safety margin work, including the Fable 5 redeployment and the figures cited, is drawn from Anthropic's published notice, linked above. The framing that the false positive is a designed cost rather than a defect, the six habits, and Our Call are this publication's argument, not Anthropic's, and should be read as such.

---
Cite as: "The False Positive Was the Point" — nextbig.dev, https://www.nextbig.dev/blog/the-false-positive-was-the-point