The first AI-infrastructure wave was about training scale. The next wave is about production control — routing, runtime, identity, safety, policy, observability, audit, retention, tool permissions, and the cost of a whole completed workflow.
The model is no longer the whole product. The runtime is.
The next durable AI-infra control plane will not only answer “which model should run?” It will answer: “should this user or agent, in this workflow, with this data, be allowed to access this level of intelligence under this policy?”
How do we serve tokens cheaply?
The inference-optimization layer: KV cache, batching, latency, throughput, GPU scheduling, quantization, and model hosting.
How do we govern capability?
The governed-execution layer: identity, semantic risk classification, capability-tier authorization, fallback, audit, retention, and tool/action policy.
Agents turn text into action.
Once models use tools, memory, code execution, internal data, and long-running state, safety and access become runtime-engineering problems — not just model or app-layer ones.
AI infra is splitting into two layers
The first layer is inference optimization: serving tokens faster and cheaper — KV cache, batching, utilization, quantization, latency, throughput, memory bandwidth, GPU scheduling, prefill/decode optimization, and model hosting.
The second layer is governed execution: deciding whether a capability should run at all, under what permissions, with what safeguards, and with which fallback. This layer matters once models become agents — long-running, tool-using, stateful systems that act on data and workflows.
1 · Inference optimization
Goal: serve tokens faster and cheaper.
- KV cache
- Batching / utilization
- Latency / throughput
- GPU scheduling
- Quantization / hosting
2 · Governed execution
Goal: decide whether and how capability should run.
- Identity + workflow context
- Semantic risk classification
- Capability-tier authorization
- Fallback / approval / audit
- Tool permissions + retention
This is the shift from cost per token to cost and control per completed workflow. A coding agent doesn’t make one model call. It searches a codebase, plans changes, edits files, runs tests, handles failures, retries, and produces a reviewable output. A clinical or legal agent doesn’t simply produce text — it touches regulated data, invokes tools, follows approval paths, and leaves an audit trail.
Anthropic’s Fable/Mythos launch made the pattern visible
Anthropic’s Claude Fable 5 / Claude Mythos 5 launch is the clearest public example of capability governance. Anthropic describes Fable 5 as a Mythos-class model made safe for general use, while Mythos 5 is the same underlying model with safeguards lifted in some areas for a small group of trusted users. For some cybersecurity, biology/chemistry, and distillation-related requests, Fable 5 falls back to Claude Opus 4.8 instead of the higher-capability Mythos-class model.1
What most people will see
- Better coding and reasoning benchmarks
- Longer autonomous work
- Stronger vision and scientific capabilities
- Higher-value use across cyber, software, and biology
What the infra lens sees
- Capability tiers, not one universal model surface
- Risk classifiers deciding access at runtime
- Fallback routing to a safer / lower-capability model
- Trusted-access programs for high-risk domains
- Retention and audit as part of product design
Why the details matter
Anthropic says Fable 5 and Mythos 5 are priced at $10 per million input tokens and $50 per million output tokens, and that the fallback behavior triggers in under 5% of sessions on average. It also says it will require 30-day retention for traffic on Mythos-class models to detect complex attacks, novel jailbreaks, and patterns that operate across many requests.1
Those are product details, but they imply a deeper architecture: request classification, capability gating, model fallback, retention, auditability, and trusted access. The release isn’t just about serving a smarter model — it’s about deciding when a higher-capability model is allowed to run.
Old world: prompt → model
- User sends a prompt.
- App chooses a model.
- Model responds.
Optimizes for: quality, latency, cost.
Control surface: basic moderation / prompt filtering.
Mental model: “Which model should I call?”
New world: capability-governed execution
- User or agent sends a request.
- Identity + workflow context attached.
- Risk / domain classified.
- Policy decides the capability tier.
- Routed to frontier model, safer fallback, cheaper model — or blocked.
- Tools, actions, logging, and retention governed.
Mental model: “Should this capability run for this workflow, under this policy?”
OpenRouter is close. Traditional IAM is adjacent. The missing layer is capability access.
The natural question is whether this is simply model routing or traditional access control. It’s related to both — but identical to neither.
Traditional IAM / IGA
Governs resource access.
Helps organizations manage who can reach which applications, systems, and data. An important input to the AI control plane — but it usually doesn’t understand model capability, semantic risk, tool actions, or multi-turn agent behavior.
Asks: “Who has access to what resource?”
Model routing / gateways
Routes model & provider access.
Platforms like OpenRouter abstract many models and providers behind one interface and support routing, fallbacks, provider policies, privacy settings, budgets, and guardrails.2
Asks: “Which model or provider should handle this request?”
Capability control plane
Governs capability access.
Combines identity, workflow context, semantic risk, model routing, retention, tool permissions, approvals, and multi-turn adversarial detection.
Asks: “Should this user/agent access this level of intelligence for this task?”
The crisp distinction
Traditional access control governs resources. Model routers govern model selection. The missing layer governs capability access.
It would integrate with identity tools, model gateways, and AI-security products — but its core value is AI-native: semantic risk classification, capability-tier authorization, model fallback, tool/action permissions, retention/audit, and multi-turn adversarial detection.
Reference architecture: the AI capability control plane
Architecturally, the missing layer is not a prompt filter. It is an execution control plane that sits between users/agents, identity systems, AI-security tools, model routers, tool runtimes, and audit systems.
User / agent request
Prompt, files, memory, tool intent, objective.
Identity + context
Who is asking? Role, org, customer, workflow, data sensitivity.
Semantic risk
Cyber? Bio? Finance? Code execution? Distillation? Adversarial?
Policy engine
Allowed? Needs fallback? Approval? Retention?
Routing engine
Frontier, safer fallback, cheaper, self-hosted — or block.
Execution + output
Result, tool calls, actions, audit trail, session replay.
Identity / access
Users, roles, entitlements, app/resource permissions.
AI security + guardrails
Prompt defense, data-leak prevention, policy enforcement.
Model routing / gateway
Provider abstraction, fallback, latency/cost routing.
The four product layers
1. Identity + context
The system needs the user, organization, role, customer, workflow, data classification, tool permissions, and autonomy level.
2. Semantic risk classification
It must classify intent and risk across prompts, retrieved data, tool calls, outputs, and multi-turn behavior.
3. Routing + action policy
It decides whether to allow the frontier model, route to fallback, require approval, redact data, block, or restrict tool use.
4. Audit + observability
It reconstructs why a model was chosen, what data was sent, what tools were called, what policy fired, and what the agent did.
Everyone is doing a piece. No one cleanly owns the layer.
The category is fragmented across model routers, AI gateways, enterprise governance tools, LLM firewalls, runtime-protection vendors, DLP, observability, and evals. The market isn’t empty — but the full, independent capability-governance layer is still not cleanly owned.
Where Lumos fits: the top-right. High capability-governance and native model routing — the one corner of the map no incumbent owns outright.
Routers climb in from the right; guardrail and governance vendors climb in from the left; model labs hold the high-governance ceiling, but only for their own models. Lumos is positioned as the independent control plane that spans both axes — identity- and workflow-aware governance with first-class routing and fallback — rather than a guardrail bolted onto a single provider.
Lumos ◆
The independent capability control plane this memo describes: identity + workflow context, semantic risk classification, capability-tier authorization, model routing & fallback, tool/action permissions, and retention/audit — across providers, not locked to one.
OpenRouter
Closest public comp for model/provider routing. Its docs show controls for budgets, model/provider restrictions, privacy policies, zero-data-retention, prompt-injection defense, and sensitive-information handling.2
Cloudflare AI Gateway
Network/infrastructure gateway with caching, rate limiting, guardrails, dynamic routing, and data-loss prevention.3
Portkey / Palo Alto
Validation that AI gateways are becoming enterprise security control planes — Palo Alto framed Portkey as helping monitor, orchestrate, and govern autonomous agents at scale.4
Credal
Close on enterprise-agent governance: permission-aware data access, policy controls, deployment, and audit logging for internal agents.5
Lakera
AI firewall / guardrail surface: prompt defense, jailbreak and injection detection, content moderation, and data-leak protection.6
The investable wedge isn’t “guardrails.” It’s policy-aware execution.
A simple prompt filter is too narrow. The more interesting wedge is a policy-aware execution layer for high-capability agentic workflows.
Where this matters first
Cybersecurity
The same capability can support defenders or accelerate attackers. The control layer has to distinguish defensive workflows from offensive misuse and govern tools/actions accordingly.
Life sciences
Scientific models can accelerate research while creating dual-use risk. Trusted access, audit, and domain-specific policy become product requirements.
Coding agents
Code agents need repo context, sandboxing, test execution, tool permissions, rollback, and human review. The model is only one part of the runtime.
Healthcare / legal
Sensitive data, compliance, retention, explainability, and human approval make generic routing insufficient.
Finance
Data sensitivity, regulated advice, audit requirements, and model-selection policy make governance inseparable from deployment.
Enterprise agents
Agents with internal tools and write access need contextual authorization — not just static app permissions.
Questions that separate a platform from a prompt filter
For companies building this layer
- Can you show the full inference and tool-call graph?
- Where do you attach identity, role, customer, and workflow context?
- Can policies vary by user, data sensitivity, customer, geography, and workflow?
- Can you route to a safer model instead of simply blocking?
- Can you detect multi-turn jailbreaks, or only single-turn attacks?
- Can you govern tool calls and write actions, not just prompts?
- Can you reconstruct what an agent did across an entire session?
For AI application companies
- What is your cost per completed workflow, not per API call?
- Which model calls are latency-sensitive vs. batchable?
- Where do retries explode?
- Which data is too sensitive to leave your environment?
- Where does human approval enter the loop?
- What actions can the agent take, and who owns the audit trail?
- Which part of the workflow would you pay to never think about again?
What could make the thesis wrong?
- Model labs absorb the whole layer. Each frontier provider may build its own capability governance, and customers may accept provider-specific controls.
- Existing gateways expand fast enough. Model routers, AI gateways, and security platforms may converge on this layer before a standalone category forms.
- Agents stay mostly read-only. Capability governance matters most when agents take actions. If usage stays read-only, lighter guardrails may suffice for longer.
- Security incumbents win distribution. The layer may become a module inside large identity, network, or security platforms rather than an independent company.
How to say it in one paragraph
Model routing is becoming strategic, but routing alone isn’t the whole category. Anthropic’s Fable/Mythos launch points at a deeper AI-native layer: capability governance. The missing product isn’t just “who can access this app?” or “which model should handle this request?” It is “should this user or agent, in this workflow, with this data, be allowed to access this level of intelligence — and if so, under what routing, retention, audit, fallback, and tool-action policy?”
Three takeaways
References and research base
- Anthropic, “Claude Fable 5 and Claude Mythos 5,” June 9, 2026. Used for the same-model distinction, fallback to Opus 4.8, the under-5% fallback statement, pricing, trusted access, safety classifiers, 30-day retention, and capability examples. Source.
- OpenRouter documentation — Guardrails, Provider Routing, and Zero Data Retention. Used for model/provider restrictions, spending limits, data policies, privacy settings, and routing behavior. Guardrails, Provider routing, ZDR.
- Cloudflare AI Gateway docs. Used for caching, rate limiting, guardrails, dynamic routing, and data-loss prevention. Source.
- Palo Alto Networks, “Palo Alto Networks Completes Acquisition of Portkey to Secure AI Agents.” Used for the AI-gateway / autonomous-agent control-plane framing. Source.
- Credal homepage and security pages. Used for enterprise-agent governance, permission-aware data access, audit logging, and policy enforcement. Homepage, Security.
- Lakera Guard documentation. Used for prompt defense, jailbreak/prompt-injection detection, content moderation, and data-leak framing. Source.
- Noma Security, “AI Runtime Protection.” Used for monitoring prompts, responses, tool calls, MCP-server interactions, and agent-to-agent communications. Source.
- SentinelOne, “SentinelOne to Acquire Prompt Security.” Used for Prompt Security as runtime GenAI security, data-leak prevention, prompt-injection enforcement, and agent protection. Source.
Source-quality note
This memo combines the Anthropic Fable/Mythos article and the inference-infrastructure thesis with primary or official sources from Anthropic, OpenRouter, Cloudflare, Palo Alto Networks, Credal, Lakera, Noma, and SentinelOne. Company-positioning claims — including market-map placement — should be read as a thesis map rather than a diligence conclusion; each vendor’s actual product depth would need customer calls and technical testing.