technology

whitepaper

about

blog

contact

The Signal

Perspectives on AI governance, enterprise risk, and the infrastructure layer the industry is still building.

technology

whitepaper

about

blog

contact

The Signal

Perspectives on AI governance, enterprise risk, and the infrastructure layer the industry is still building.

technology

whitepaper

about

blog

contact

The Signal

Perspectives on AI governance, enterprise risk, and the infrastructure layer the industry is still building.

back

Measuring What Matters: A New Standard for AI Runtime Enforcement

Introducing AREB — the Algedonic Runtime Enforcement Benchmark — and the numbers that prove it works.

Sandeep Gopisetty

May 16, 2026

Sandeep Gopisetty

May 16, 2026

Look at the published benchmarks for AI infrastructure today.

Kong's marketing for its AI Gateway cites 28,000+ RPS and sub-10 ms overhead, though the published benchmark itself measures P95 latency at ~24 ms and P99 at ~30 ms against a WireMock'd OpenAI. TensorZero's is tighter — sub-millisecond P99 (~0.94 ms) at 10,000 QPS, with the load generator co-located on the same instance as the gateway and the mock LLM. TrueFoundry and LiteLLM publish gateway-overhead numbers in the single-digit-millisecond range (Portkey's own latency claim is in that neighborhood too, though they don't publish a first-party benchmark methodology — only a GitHub README tagline and a blog that quotes "sub-10 ms" and "<40 ms" on the same page). The Open Policy Agent docs report per-decision evaluation times in the tens of microseconds — roughly 36 µs median and 134 µs P99 from opa bench on a sample RBAC policy. A third-party arXiv benchmark measures Llama Guard 3 1B at ~165 ms per call on A30 GPUs. Lakera markets sub-50 ms guardrails.

All those numbers are real, and most are honestly measured. And none of them tell you what it actually costs to govern an AI agent in production.

Here is the dirty secret: every one of those benchmarks measures a part of the problem in isolation, with the other parts disabled, and frames it as a complete answer. Kong's published methodology configures the gateway in proxy-only mode — Kong specifically calls out caching and API-key authentication as disabled. TensorZero's disables observability for comparability with LiteLLM (they note their async observability "wouldn't materially affect" latency, which is plausible but unmeasured in the published number). Guardrail vendors publish single-classifier latency for one input rail at a time. OPA measures policy decisions on structured JSON inputs — it has no idea what a prompt is, what a tool call is, what an agent's plan looks like.

If you are responsible for putting autonomous AI agents into production at a bank, a hospital, an insurer, or any other regulated environment, the number you need is not "how fast does the proxy hop?" or "how fast does the classifier run?" The number you need is: what does it cost, end to end, to enforce policy on every step of an agent's trajectory — pre-action and post-action, with observability and audit emission turned on, on traffic that mixes clean prompts with PII, tool calls, and jailbreak attempts?

That number does not exist in the published literature. Which is why we built AREB — the Algedonic Runtime Enforcement Benchmark.

The three buckets of incomplete benchmarks

The benchmarking gap becomes obvious when you lay the three current categories of AI infrastructure side by side.

1. Policy decision engines

Engines like Open Policy Agent are extraordinary at what they do. opa bench on OPA's sample RBAC policy reports decision times in the tens of microseconds — roughly 36 µs median and 134 µs P99. Deployed as a sidecar via OPA-Envoy External Authorization, third-party measurements (Solo.io's 2021 Performance tuning for ExtAuth using OPA) put the cost in the low single-digit milliseconds at P95 with a NOP policy, and the OPA docs themselves have historically noted that adding a NOP-policy authz sidecar at least doubles tail latency between p90 and p99.

But OPA's input is a structured JSON document; its Rego policy evaluates against fields, not semantics. OPA does not look at prompts, reason about agent plans, validate tool calls, or inspect model output. It answers a precise structured question precisely.

2. AI gateways

Gateways measure the cost of being a proxy between a client and an LLM. Their published numbers highlight a clear limitation:

Gateway	Published number	Methodology highlight	Policy in path?
Kong AI Gateway	Marketing: 28k+ RPS, sub-10 ms overhead. Benchmark: P95 ~24 ms / P99 ~30 ms	K6, 400 VUs, c5.4xlarge, WireMock OpenAI; proxy-only config (no caching, no API-key auth)	No — proxy-only
TensorZero	<1 ms P99 (~0.94 ms) @ 10k QPS	c7i.xlarge, observability disabled, load-gen co-located with gateway and mock LLM	No — logging off
TrueFoundry	3–5 ms overhead @ 250 RPS	LLM-Locust, 1 vCPU / 1 GB, fake OpenAI endpoint	Not stated
LiteLLM	8 ms P95 overhead @ ~1,170 RPS (4 instances)	Locust, 4 vCPU / 8 GB per instance, callbacks empty, no Redis	Not stated
Portkey	Self-claim <1 ms (GitHub) / sub-10 ms (blog)	No first-party benchmark methodology	Not stated

Every cell in that "Policy in path?" column is "No." These are the right numbers to publish if what you ship is a proxy. They are misleading numbers if what you actually ship is an enforcement layer.

3. Guardrail and content-safety products

Products like NeMo Guardrails, Llama Guard, Lakera, Azure AI Content Safety, AWS Bedrock Guardrails, Cloudflare AI Gateway, and IBM Granite Guardian measure a single classifier in isolation, or a bundle of parallel input rails. NVIDIA publishes the most concrete number in this space: NeMo Guardrails orchestrating up to five GPU-accelerated guardrails in parallel adds ~0.5 seconds of latency. The rest range from Lakera's <50 ms marketing claim to Microsoft's documented 100–300 ms per request for sequential Azure AI Content Safety filters to Cloudflare's documented ~500 ms per request when Llama Guard 3 8B runs on Workers AI in the AI Gateway path. These are useful numbers if you are buying one guardrail to slot in front of one model.

Now here is the question none of those numbers answer: What does it cost to do all three at once on the same request, on every step of an agent's trajectory, with observability on?

The missing shape: end-to-end agent governance

An autonomous agent in a regulated environment doesn't make one LLM call and stop. It plans. It calls a tool. It reads the result. It calls another tool. It synthesizes. It returns to the user. Each step is an opportunity for the agent to drift outside policy — to exfiltrate regulated data, to invoke a tool it shouldn't, to act on a prompt-injection that arrived in a tool result.

Governing this means enforcing policy at two places on every step:

PEP-A — pre-action enforcement runs before the model call (or before the tool invocation). It answers: Can this agent call this model? Is the prompt about to leave its trust zone? Does it contain regulated data? Is this action allowed for this role at this risk level? It combines structured policy decisions (the OPA bucket) with semantic ones (the guardrail bucket) — intent classification, capability check, PII screening, all together, on the critical path, before the request egresses.

PEP-B — post-action enforcement runs after the model returns, or before a tool side-effect lands. It answers: Is the output leaking PII? Is the agent about to invoke an unauthorized tool with arguments outside the allowed schema? Is the content within policy? Should this be redacted, blocked, or audited?

To govern an agent end-to-end you need both of these enforcement points active on every step. You need the audit trail. You need the decision distribution — how often did the layer allow, how often did it redact, how often did it block, broken out by policy bundle.

That is what no existing benchmark measures. It is also what enterprises are actually buying when they buy "AI governance."

Introducing AREB

The Algedonic Runtime Enforcement Benchmark — AREB — is the methodology for filling that gap.

AREB defines:

Two enforcement points (PEP-A pre-action, PEP-B post-action) that every conformant implementation must instrument separately.
Four test flows: baseline, PEP-A only, PEP-B only, and full (PEP-A + PEP-B), so the per-stage cost can be isolated.
A policy-bundle catalogue spanning the realistic complexity range — from allow-all (the framework floor) through RBAC, regex PII, OPA/Rego, tool-call inspection, intent classification, multi-signal risk scoring, and full audit with structured telemetry.
A structured trace schema that every request must emit, with per-stage timings (intent analysis, policy evaluation, redaction, audit emission) and a recorded decision class (allow, warn, redact, block).
A required prompt mix — 70% clean prompts, 10% PII-laden prompts, 10% tool-call requests, 5% jailbreak attempts, 5% long-context prompts — so the decision distribution under measurement is non-degenerate.
A concurrency sweep from 1 to 250 concurrent clients with throughput-vs-saturation reported.

The headline metric is Algedonic Overhead = Governed end-to-end latency − Baseline end-to-end latency, reported at P50, P95, and P99 in absolute milliseconds, alongside the per-stage breakdown and the decision-class distribution per policy bundle.

Critically, AREB reports overhead in absolute milliseconds, not as a percentage of total request time. Percentage overhead is misleading when LLM inference dominates the request — adding 25 ms to an 800 ms LLM call looks like "3% overhead," which compresses against the engineering reality of what that 25 ms actually buys you (or costs you).

What AREB explicitly disallows

If AREB is going to function as a standard rather than another vendor benchmark, it has to define what disqualifies a result. AREB explicitly prohibits:

Disabling observability, logging, or audit emission to lower the headline figure. Both TensorZero and Kong do this in their published numbers — they are within their rights to, but AREB does not permit it. Audit-on is the production case.
Reporting a NOP policy bundle as the headline result. NOP-policy runs are useful for isolating framework overhead, but they must be labelled and presented as a separate row, not as "the product number."
Co-locating the load generator and the gateway on the same host without disclosing it. Where co-location is unavoidable, it must be disclosed and the resulting proxy-hop component of the figure must be qualified as excluding network.
Reporting only median latency. P95 and P99 are required.
Reporting only percentage overhead. Absolute milliseconds at every percentile are required, because percentage overhead compresses against LLM-dominated request times.
Reporting a simulated component as a measured one. Where any stage uses a stub or calibrated placeholder rather than a real model in the path, it must be labelled as such and excluded from the headline figure. (This rule is new in v2, and it is the reason this update looks the way it does — see below.)

A run that fails any of these rules is not an AREB run. The integrity of cross-vendor comparison depends on this strictness.

v2: reporting only what we actually measured in the path

This update reports only the five bundles that run entirely in-process, with no simulated component anywhere on the path: allow-all, rbac, tool-call-inspection, regex-pii, and opa-rego (the last calling a real Open Policy Agent sidecar over HTTP). Every number below is the cost of real policy code executing on a real request. The three classifier-bearing bundles move to Phase-2 v2, where they will be measured with real Llama Guard 3 and Granite Guardian inference rather than a stub — and reported as classifier-inference-cost plus framework overhead, not as a single blended figure.

The result of that discipline is a smaller table and a much stronger claim.

The real-world data: Phase 2 production results (v2)

While Phase 1 validated the framework against a mock LLM, Phase 2 ran the same protocol against a real production endpoint under heavy load.

Test parameters

Environment: AWS EC2 instance in us-east-2 (same region as the OpenAI endpoint to minimize regional network variance).
Upstream model: OpenAI's gpt-4o-mini.
Traffic volume: 125,000 total requests swept across the five in-process policy bundles and seven concurrency levels (1, 5, 10, 25, 50, 100, 250) over the measurement window — 35 test cells, all completed, zero failures.
System integrity: Full observability and audit emission active on every run — no configuration flags disabled, no simulated stages in the reported set.
Reliability: Zero 429 rate-limit failures across all 35 cells. OpenAI's effective throughput ceiling sat somewhere between concurrency 100 and 250 — the endpoint queued rather than rejecting — but the harness never bottlenecked.

The headline production table

Bundle	PEP-A P95 (range across c=1..250)	PEP-B P95 (range across c=1..250)	Decisions
allow-all	0.03 ms	0.04 ms	100% allow
rbac	0.05–0.06 ms	0.03 ms	100% allow
tool-call-inspection	0.04 ms	0.04 ms	100% allow
regex-pii	0.09 ms at c=1 → 0.58 ms warm	0.39–0.41 ms	87.6% allow · 12.4% redact
opa-rego	0.97–1.27 ms	0.03 ms	100% allow

Read the per-bundle range as the variation across the seven concurrency levels. The variation within each row is under one millisecond. The worst-case P95 anywhere in the table is 1.27 ms — the real-OPA bundle at maximum concurrency. Every in-process AREB enforcement path is sub-2 ms P95, with audit emission on, across the full concurrency sweep.

A note on total latency before we go further

The careful reader will notice the elephant in the room: if AREB enforcement is sub-2 ms but total request time is multi-second, something else is responsible for those multi-second numbers.

total_latency_ms is dominated entirely by OpenAI inference time — multi-second P50 and a long P95 tail, because the AREB prompt mix is heavier than typical single-prompt benchmark traffic (it includes 5% long-context prompts at ~1,000 input tokens each and elicits 70+ output tokens per response on average). That mix is part of the methodology and is what produces realistic decision distributions on the post-inference side.

Critically, the variance is essentially identical across every bundle, including allow-all, which adds no enforcement work. The tail behaviour of gpt-4o-mini under sustained load is gpt-4o-mini's tail behaviour. AREB's job is to measure the marginal cost of enforcement on top of that — which is what pep_a_ms and pep_b_ms give us.

Four things the data tells us

1. Enforcement overhead is essentially independent of concurrency

The per-bundle range across c=1 to c=250 is under one millisecond for every bundle. opa-rego, the bundle with the most concurrency-sensitive component (HTTP round-trips to a real OPA sidecar), varies from 0.97 ms at c=50 to 1.27 ms at c=250 — small enough to be operational noise. This is the key result, because it means AREB enforcement cost is essentially a property of what the policy does, not how loaded the system is. It stays bounded under load, which is the property a CISO is buying when they buy an enforcement layer.

2. The framework floor is sub-millisecond

allow-all, rbac, and tool-call-inspection all sit at sub-tenth-of-a-millisecond P95 on both stages. That is the cost of having an enforcement layer turned on, with audit emission active, doing minimal policy work. Anything beyond it, the more complex bundles pay for in policy work — and that policy work is exactly what shows up in the higher rows.

3. Real OPA in the path costs about 1 millisecond

The opa-rego bundle calls a real Open Policy Agent sidecar via HTTP on every PEP-A invocation. We measured 0.97–1.27 ms P95. That's faster than the published OPA-Envoy External Authorization figures (Solo.io reports P95 ≈ 2.5–3 ms; the OPA docs note that adding a NOP-policy authz sidecar at least doubles tail latency between p90 and p99). The difference is the deployment model — we call OPA over HTTP from inside the same process rather than via Envoy's ext_authz gRPC interface, so we don't pay the Envoy hop. Either way, the number gives us a real OPA-cost data point measured inside the AREB methodology.

4. The decision distribution is non-degenerate where the policy can act

regex-pii triggered a 12.4% redact rate across the sweep — matching the 10% PII prompts in the mix plus a small fraction of incidental email matches in clean prompts. That means the per-stage timings for that bundle reflect the cost of policy work that includes the redact path, not just the green-path cost. A benchmark that only ever returns "allow" tells you nothing about the cost of the redact path or the block path.

The other four in-process bundles return 100% allow on this prompt mix by construction (allow-all and rbac have no content-conditional disposition; tool-call-inspection only acts when the model emits a tool call to validate; opa-rego passes the structured RBAC check on this traffic). The block path is exercised by the classifier-bearing bundles, which is exactly why they move to Phase-2 v2 with real models — measuring the block path honestly requires a real classifier, not a stub.

A transparent note on the GIL, for Phase-2 v2

There is an engineering nuance that systems purists will rightly raise the moment we swap a real, heavy, CPU-bound classifier into the path. In the I/O-dominated regime measured here — where workers spend seconds waiting on an upstream OpenAI response and the in-process policy work is sub-2 ms — a single-process asyncio event loop handles concurrency flawlessly. The flat, concurrency-independent latency above is the evidence.

But a real on-host classifier model is CPU-bound, and CPU-bound work serializes at CPython's Global Interpreter Lock. Our internal closed-form modeling predicts that once localized policy CPU work pushes into the 5–10 ms range (the regime of complex regex catalogs, nested JSON schemas, or a co-located classifier), a single-process asyncio architecture degrades super-linearly under load. To hold a flat, linear latency floor in true high-concurrency production environments, moving to a multi-process worker pool (Gunicorn fronting multiple Uvicorn workers) becomes an architectural necessity.

We are calling this out before it bites, because the integrity of a benchmark standard depends on it. Phase-2 v2 will map this exact boundary empirically by swapping the stubs for local model deployments like Llama Guard 3 1B.

What this means for the comparator landscape

The first sections of this post laid out the three buckets of published benchmarks and argued that none of them measured what enterprises actually need for agent governance. Phase 2 v2 gives us comparator numbers we can put alongside the published ones — all of them real, none of them simulated:

OPA-alone is measured in the low single-digit milliseconds at P95 when fronted by OPA-Envoy External Authorization. AREB's opa-rego bundle, with the OPA sidecar embedded in an actual agent-traffic path, sits at ~1 ms P95. The difference is the deployment model — we call OPA over HTTP from inside the same process rather than via Envoy's ext_authz gRPC interface, so we don't pay the Envoy hop.
AI gateways report single-digit-millisecond figures for pure proxy hops with policy out of path. AREB's allow-all bundle — the closest analogue (single PASS rule, no real policy work) — adds 0.07 ms P95 (PEP-A + PEP-B combined) at single-client concurrency. AREB-with-no-policy is comparable to AI gateway proxy figures, and AREB with policy adds bounded, predictable overhead on top — at most ~1.3 ms across everything we measured in-process.
Guardrails report single-classifier figures from Lakera's <50 ms marketing up to Cloudflare's ~500 ms with Llama Guard 3 8B in path, with NeMo's five-rail parallel orchestration around ~0.5 seconds. The AREB methodology is designed so a real classifier adds its inference latency plus single-digit milliseconds of framework — Phase-2 v2 (with real Llama Guard 3 1B) will measure this empirically and is expected to land near ~167 ms PEP-A (≈165 ms classifier + ≈2 ms framework, subject to the GIL caveat above). That's the contract this methodology will publish — and we are deliberately not publishing a number for it until we've measured it.

The category AREB occupies — paired pre-action and post-action enforcement on a realistic agent workload, with all costs accounted for on the critical path — now has its first published reference numbers. We invite reproductions.

Phase 4: from per-step cost to per-trajectory cost

Everything above measures the cost of enforcing a single step. But the production unit isn't a step — it's a trajectory. An autonomous agent answering one user request plans, calls a tool, reads the result, calls another, synthesizes, and replies. If the enforcement layer sits on every step — and it must, because non-bypassable enforcement is the entire point — then the cost a user actually pays is the per-step cost multiplied by the number of steps.

A naive reading of the Phase-2 table would simply multiply: an intent-bearing bundle at ~20 ms, run across a six-step trajectory, is ~120 ms of enforcement before you have accounted for anything else. That product is real, and it is precisely the number that would make a latency-sensitive team turn enforcement off. So the per-step result is necessary but not sufficient. Phase 4 extends AREB from steps to trajectories and models the two mechanisms that keep trajectory overhead from growing linearly with length: caching and decision propagation.

Caching: what is stable across a trajectory. Within a single trajectory, most of pre-action enforcement is redundant. The agent's identity and role do not change between steps, so the RBAC verdict can be computed once and reused. The user's intent is usually stable across the internal steps that serve one request — so the intent classifier, the single most expensive stage by an order of magnitude, need not run on every step. Output-side work is different: PII redaction and tool-call inspection examine fresh content at each step and can never be cached. AREB therefore defines a trajectory-scoped verdict cache that memoizes the identity, RBAC, and intent verdicts while per-step content checks always run.

The important property falls straight out of this: the one cost that scales with the length of a trajectory — the classifier — is also the one cost that caches cleanly across it. Caching here does not shave a few percent off; it changes the shape of the curve, from grows with every step to paid once.

But caching intent is not free, and we refuse to pretend it is. A cached intent verdict assumes intent does not drift. A trajectory that begins benign and turns adversarial at step seven — because a prompt-injection arrived in a tool result, say — would, under unconditional caching, never be re-examined. That is a security regression dressed up as a performance win. So AREB treats caching as a spectrum with an explicit correctness cost. At one end, no caching re-runs the classifier every step (maximum cost, maximum freshness). At the other, full caching classifies once (minimum cost, maximum staleness risk). Between them sits a bounded-staleness policy that re-classifies every K steps or whenever a trigger fires — a tool call, a risk signal, or content that has diverged too far from what was last classified. The benchmark requires the caching policy to be disclosed and reported as an ablation, so the overhead saved is always visible next to the freshness given up.

Decision propagation. Decisions also flow forward. A trajectory carries a context object into which early stages write their findings — a role, a risk score, a cached intent label — so later stages reuse rather than recompute them. Propagation also governs termination: a DENY, QUARANTINE, or BLOCK at any step ends the trajectory, so the steps that would have followed are never enforced — or executed — at all. Early termination is both a safety property and a source of savings, and AREB reports it as on or off.

What we have built, and what we are not publishing yet. We have implemented the trajectory harness and run a synthetic pilot to confirm the metrics behave as designed — trajectory overhead under each caching policy, intent-classifier calls per trajectory, early-termination rate, and amortized per-step cost. The pilot validates the qualitative story end to end: caching collapses trajectory overhead from a quantity that scales with length to one that effectively does not, and bounded staleness recovers most of that saving while preserving the ability to catch intent drift. But, consistent with the discipline of this update, the pilot runs against the synthetic upstream and the stubbed classifier — so we are deliberately not publishing trajectory latency numbers here. Same rule as the classifier bundles: no simulated figure in the headline. The trajectory numbers will land with Phase-4 v2, paired with the real-classifier swap, as the first end-to-end per-trajectory governance figures.

What's next

Bundle catalogue contributions. The interesting future additions are real-classifier-backed bundles for specific industries — financial services PII, healthcare PHI, EU AI Act compliance. Each is a small, well-defined contribution.

Algedonic.ai is building the policy enforcement layer for autonomous AI agents in regulated environments. AREB is the benchmark we wished existed when we started.

Mapping the Minefield ›