Measuring What Matters: A New Standard for AI Runtime Enforcement
Introducing AREB — the Algedonic Runtime Enforcement Benchmark — and the numbers that prove it works.
Look at the published benchmarks for AI infrastructure today.
Kong's marketing for its AI Gateway cites 28,000+ RPS and sub-10 ms overhead, though the published benchmark itself measures P95 latency at ~24 ms and P99 at ~30 ms against a WireMock'd OpenAI. TensorZero's is tighter — sub-millisecond P99 (~0.94 ms) at 10,000 QPS, with the load generator co-located on the same instance as the gateway and the mock LLM. TrueFoundry and LiteLLM publish gateway-overhead numbers in the single-digit-millisecond range (Portkey's own latency claim is in that neighborhood too, though they don't publish a first-party benchmark methodology — only a GitHub README tagline and a blog that quotes "sub-10 ms" and "<40 ms" on the same page). The Open Policy Agent docs report per-decision evaluation times in the tens of microseconds — roughly 36 µs median and 134 µs P99 from opa bench on a sample RBAC policy. A third-party arXiv benchmark measures Llama Guard 3 1B at ~165 ms per call on A30 GPUs. Lakera markets sub-50 ms guardrails.
All those numbers are real, and most are honestly measured. And none of them tell you what it actually costs to govern an AI agent in production.
Here is the dirty secret: every one of those benchmarks measures a part of the problem in isolation, with the other parts disabled, and frames it as a complete answer. Kong's published methodology configures the gateway in proxy-only mode — Kong specifically calls out caching and API-key authentication as disabled. TensorZero's disables observability for comparability with LiteLLM (they note their async observability "wouldn't materially affect" latency, which is plausible but unmeasured in the published number). Guardrail vendors publish single-classifier latency for one input rail at a time. OPA measures policy decisions on structured JSON inputs — it has no idea what a prompt is, what a tool call is, what an agent's plan looks like.
If you are responsible for putting autonomous AI agents into production at a bank, a hospital, an insurer, or any other regulated environment, the number you need is not "how fast does the proxy hop?" or "how fast does the classifier run?" The number you need is: what does it cost, end to end, to enforce policy on every step of an agent's trajectory — pre-action and post-action, with observability and audit emission turned on, on traffic that mixes clean prompts with PII, tool calls, and jailbreak attempts?
That number does not exist in the published literature. Which is why we built AREB — the Algedonic Runtime Enforcement Benchmark.
The three buckets of incomplete benchmarks
The benchmarking gap becomes obvious when you lay the three current categories of AI infrastructure side by side.
1. Policy decision engines
Engines like Open Policy Agent are extraordinary at what they do. opa bench on OPA's sample RBAC policy reports decision times in the tens of microseconds — roughly 36 µs median and 134 µs P99. Deployed as a sidecar via OPA-Envoy External Authorization, third-party measurements (Solo.io's 2021 Performance tuning for ExtAuth using OPA) put the cost in the low single-digit milliseconds at P95 with a NOP policy, and the OPA docs themselves have historically noted that adding a NOP-policy authz sidecar at least doubles tail latency between p90 and p99.
But OPA's input is a structured JSON document; its Rego policy evaluates against fields, not semantics. OPA does not look at prompts, reason about agent plans, validate tool calls, or inspect model output. It answers a precise structured question precisely.
2. AI gateways
Gateways measure the cost of being a proxy between a client and an LLM. Their published numbers highlight a clear limitation:
Gateway | Published number | Methodology highlight | Policy in path? |
|---|---|---|---|
Kong AI Gateway | Marketing: 28k+ RPS, sub-10 ms overhead. Benchmark: P95 ~24 ms / P99 ~30 ms | K6, 400 VUs, c5.4xlarge, WireMock OpenAI; proxy-only config (no caching, no API-key auth) | No — proxy-only |
TensorZero | <1 ms P99 (~0.94 ms) @ 10k QPS | c7i.xlarge, observability disabled, load-gen co-located with gateway and mock LLM | No — logging off |
TrueFoundry | 3–5 ms overhead @ 250 RPS | LLM-Locust, 1 vCPU / 1 GB, fake OpenAI endpoint | Not stated |
LiteLLM | 8 ms P95 overhead @ ~1,170 RPS (4 instances) | Locust, 4 vCPU / 8 GB per instance, callbacks empty, no Redis | Not stated |
Portkey | Self-claim <1 ms (GitHub) / sub-10 ms (blog) | No first-party benchmark methodology | Not stated |
Every cell in that "Policy in path?" column is "No." These are the right numbers to publish if what you ship is a proxy. They are misleading numbers if what you actually ship is an enforcement layer.
3. Guardrail and content-safety products
Products like NeMo Guardrails, Llama Guard, Lakera, Azure AI Content Safety, AWS Bedrock Guardrails, Cloudflare AI Gateway, and IBM Granite Guardian measure a single classifier in isolation, or a bundle of parallel input rails. NVIDIA publishes the most concrete number in this space: NeMo Guardrails orchestrating up to five GPU-accelerated guardrails in parallel adds ~0.5 seconds of latency. The rest range from Lakera's <50 ms marketing claim to Microsoft's documented 100–300 ms per request for sequential Azure AI Content Safety filters to Cloudflare's documented ~500 ms per request when Llama Guard 3 8B runs on Workers AI in the AI Gateway path. These are useful numbers if you are buying one guardrail to slot in front of one model.
Now here is the question none of those numbers answer: What does it cost to do all three at once on the same request, on every step of an agent's trajectory, with observability on?
The missing shape: end-to-end agent governance
An autonomous agent in a regulated environment doesn't make one LLM call and stop. It plans. It calls a tool. It reads the result. It calls another tool. It synthesizes. It returns to the user. Each step is an opportunity for the agent to drift outside policy — to exfiltrate regulated data, to invoke a tool it shouldn't, to act on a prompt-injection that arrived in a tool result.
Governing this means enforcing policy at two places on every step:
PEP-A — pre-action enforcement runs before the model call (or before the tool invocation). It answers: Can this agent call this model? Is the prompt about to leave its trust zone? Does it contain regulated data? Is this action allowed for this role at this risk level? It combines structured policy decisions (the OPA bucket) with semantic ones (the guardrail bucket) — intent classification, capability check, PII screening, all together, on the critical path, before the request egresses.
PEP-B — post-action enforcement runs after the model returns, or before a tool side-effect lands. It answers: Is the output leaking PII? Is the agent about to invoke an unauthorized tool with arguments outside the allowed schema? Is the content within policy? Should this be redacted, blocked, or audited?
To govern an agent end-to-end you need both of these enforcement points active on every step. You need the audit trail. You need the decision distribution — how often did the layer allow, how often did it redact, how often did it block, broken out by policy bundle.
That is what no existing benchmark measures. It is also what enterprises are actually buying when they buy "AI governance."
Introducing AREB
The Algedonic Runtime Enforcement Benchmark — AREB — is the methodology for filling that gap.
AREB defines:
Two enforcement points (PEP-A pre-action, PEP-B post-action) that every conformant implementation must instrument separately.
Four test flows: baseline, PEP-A only, PEP-B only, and full (PEP-A + PEP-B), so the per-stage cost can be isolated.
Eight policy bundles spanning the realistic complexity range — from allow-all (the framework floor) through RBAC, regex PII, OPA/Rego, intent classification, multi-signal risk scoring, tool-call inspection, and full audit with structured telemetry.
A structured trace schema that every request must emit, with per-stage timings (intent analysis, policy evaluation, redaction, audit emission) and a recorded decision class (allow, warn, redact, block).
A required prompt mix — 70% clean prompts, 10% PII-laden prompts, 10% tool-call requests, 5% jailbreak attempts, 5% long-context prompts — so the decision distribution under measurement is non-degenerate.
A concurrency sweep from 1 to 250 concurrent clients with throughput-vs-saturation reported.
The headline metric is Algedonic Overhead = Governed end-to-end latency − Baseline end-to-end latency, reported at P50, P95, and P99 in absolute milliseconds, alongside the per-stage breakdown and the decision-class distribution per policy bundle.
Critically, AREB reports overhead in absolute milliseconds, not as a percentage of total request time. Percentage overhead is misleading when LLM inference dominates the request — adding 25 ms to an 800 ms LLM call looks like "3% overhead," which compresses against the engineering reality of what that 25 ms actually buys you (or costs you).
What AREB explicitly disallows
If AREB is going to function as a standard rather than another vendor benchmark, it has to define what disqualifies a result. AREB explicitly prohibits:
Disabling observability, logging, or audit emission to lower the headline figure. Both TensorZero and Kong do this in their published numbers — they are within their rights to, but AREB does not permit it. Audit-on is the production case.
Reporting a NOP policy bundle as the headline result. NOP-policy runs are useful for isolating framework overhead, but they must be labelled and presented as a separate row, not as "the product number."
Co-locating the load generator and the gateway on the same host without disclosing it. Where co-location is unavoidable, it must be disclosed and the resulting proxy-hop component of the figure must be qualified as excluding network.
Reporting only median latency. P95 and P99 are required.
Reporting only percentage overhead. Absolute milliseconds at every percentile are required, because percentage overhead compresses against LLM-dominated request times.
A run that fails any of these rules is not an AREB run. The integrity of cross-vendor comparison depends on this strictness.
The real-world data: Phase 2 production results
While Phase 1 validated the framework against a mock LLM, Phase 2 ran the same protocol against a real production endpoint under heavy load.
Test parameters
Environment: AWS EC2 instance in
us-east-2(same region as the OpenAI endpoint to minimize regional network variance).Upstream model: OpenAI's
gpt-4o-mini.Traffic volume: 200,000 total requests swept across 8 policy bundles and 7 concurrency levels (1, 5, 10, 25, 50, 100, 250) over a 12-hour window.
Total OpenAI cost: $32.04.
System integrity: Full observability and audit emission active on every run — no configuration flags disabled.
Account tier: sufficient to avoid explicit 429s across the entire sweep. Lower-tier accounts running the same protocol would see significant throttling at concurrency ≥ 100, which the AREB loadgen handles via
Retry-Afterbackoff — the methodology supports it, the numbers below were measured without it.
Zero 429 rate-limit failures. OpenAI's effective throughput ceiling sat somewhere between concurrency 100 and 250 — the endpoint queued rather than rejecting — but the harness never bottlenecked.
The headline production table
Bundle | PEP-A P95 (range across c=1..250) | PEP-B P95 (range across c=1..250) | Decisions |
|---|---|---|---|
allow-all | 0.03 ms | 0.04 ms | 100% allow |
rbac | 0.05–0.06 ms | 0.03 ms | 100% allow |
tool-call-inspection | 0.04 ms | 0.04 ms | 100% allow |
regex-pii | 0.58 ms (after warm-up) | 0.39–0.41 ms | 87.6% allow · 12.4% redact |
opa-rego | 0.97–1.27 ms | 0.03 ms | 100% allow |
intent-classification | 20.42–20.56 ms | 0.03 ms | 96.6% allow · 3.4% block |
full-audit | 20.37–20.73 ms | 0.18–0.23 ms | 89.4% allow · 7.3% redact · 3.3% block |
multi-signal-risk | 25.37–25.71 ms | 0.15–0.18 ms | 96.6% allow · 3.4% block |
Read the per-bundle range as the variation across the seven concurrency levels. The variation within each row is under one millisecond.
A note on total latency before we go further
The careful reader will notice the elephant in the room: if AREB enforcement is sub-30 ms but total request time is multi-second, something else is responsible for those multi-second numbers.
total_latency_ms is dominated entirely by OpenAI inference time. P50 across the eight bundles ranged from 3.0 to 6.6 seconds; P95 stretched from 10.6 to 32.6 seconds. Those are higher than the published gpt-4o-mini reference figures (which typically land at 400–800 ms median) because the AREB prompt mix is heavier than typical single-prompt benchmark traffic — it includes 5% long-context prompts at ~1,000 input tokens each and elicits 70+ output tokens per response on average. That mix is part of the methodology and is what produces realistic decision distributions on the post-inference side.
Critically, the variance is essentially identical across every bundle, including allow-all, which adds no enforcement work. The tail behaviour of gpt-4o-mini under sustained load is gpt-4o-mini's tail behaviour. AREB's job is to measure the marginal cost of enforcement on top of that — which is what pep_a_ms and pep_b_ms give us.
Five things the data tells us
1. Enforcement overhead is essentially independent of concurrency
The per-bundle range across c=1 to c=250 is under one millisecond for every bundle. opa-rego, the bundle with the most concurrency-sensitive component (HTTP round-trips to a real OPA sidecar), varies from 0.97 ms at c=50 to 1.27 ms at c=250 — small enough to be operational noise. This is the key result, because it means AREB enforcement cost is essentially a property of what the policy does, not how loaded the system is. It stays bounded under load, which is the property a CISO is buying when they buy an enforcement layer.
2. The framework floor is sub-millisecond
allow-all, rbac, and tool-call-inspection all sit at sub-tenth-of-a-millisecond P95 on both stages. That is the cost of having an enforcement layer turned on, with audit emission active, doing minimal policy work. Anything beyond it, the more complex bundles pay for in policy work — and that policy work is exactly what shows up in the higher rows.
3. Real OPA in the path costs about 1 millisecond
The areb-opa-rego bundle calls a real Open Policy Agent sidecar via HTTP on every PEP-A invocation. We measured 0.97–1.27 ms P95. That's faster than the published OPA-Envoy External Authorization figures (Solo.io reports P95 ≈ 2.5–3 ms; the OPA docs note that adding a NOP-policy authz sidecar at least doubles tail latency between p90 and p99). The difference is the deployment model — we call OPA over HTTP from inside the same process rather than via Envoy's ext_authz gRPC interface. Either way, the number gives us a real OPA-cost data point measured inside the AREB methodology.
4. Classifier-bearing bundles isolate the framework floor — but flag the GIL
In this Phase-2 run, our intent-classification, full-audit, and multi-signal-risk bundles cluster around 20–26 ms PEP-A. To be entirely transparent, these specific loads rely on an async-yield simulation — calibrated to the 20–25 ms latency profile of lightweight production guardrails — to cleanly isolate our core framework's baseline overhead. The true framework residual here is remarkably lean, consuming only single-digit milliseconds.
However, there is an engineering nuance that systems purists will rightly call out: an async sleep timer consumes zero CPU cycles, meaning it completely bypasses CPython Global Interpreter Lock (GIL) contention under heavy concurrency. In an I/O-dominated regime where workers spend roughly 3 seconds waiting on an upstream OpenAI response, a single-process event loop handles concurrency flawlessly.
But once you introduce a real, heavy, CPU-bound on-host classifier model, request serialization at the GIL will manifest. Our internal closed-form mathematical modeling predicts that when localized policy CPU work pushes into the 5–10 ms range (the regime of complex regex catalogs or nested JSON schemas), a single-process asyncio architecture will degrade super-linearly under load.
To maintain a flat, linear latency floor in true high-concurrency production environments, moving to a multi-process worker pool (such as Gunicorn fronting multiple Uvicorn workers) becomes an architectural necessity. We are calling out this limitation transparently because the integrity of a benchmark standard depends on it. Phase-2 v2 will empirically map this exact boundary by swapping out the stubs for local model deployments like Llama Guard 3.
5. Decision distributions validate the prompt mix
200,000 requests through the AREB prompt mix produced realistic, non-degenerate decision distributions: regex-pii triggered a 12.4% redact rate, intent-classification produced a 3.4% block rate, full-audit showed 7.3% redact and 3.3% block. The benchmark exercised the redact and block paths in measurable proportions, which means the per-stage timings reflect the cost of policy work that includes non-allow paths — not just the green-path cost. A benchmark that only ever returns "allow" tells you nothing about the cost of the redact path or the block path.
What this means for the comparator landscape
The first sections of this post laid out the three buckets of published benchmarks and argued that none of them measured what enterprises actually need for agent governance. Phase 2 gives us comparator numbers we can put alongside the published ones:
OPA-alone is measured in the low single-digit milliseconds at P95 when fronted by OPA-Envoy External Authorization. AREB's
areb-opa-regobundle, with the OPA sidecar embedded in an actual agent-traffic path, sits at ~1 ms P95. The difference is the deployment model — we call OPA over HTTP from inside the same process rather than via Envoy'sext_authzgRPC interface, so we don't pay the Envoy hop.AI gateways report single-digit-millisecond figures for pure proxy hops with policy out of path. AREB's
allow-allbundle — the closest analogue (single PASS rule, no real policy work) — adds 0.07 ms P95 (PEP-A + PEP-B combined) at single-client concurrency. AREB-with-no-policy is comparable to AI gateway proxy figures, and AREB with policy adds bounded, predictable overhead on top.Guardrails report single-classifier figures from Lakera's <50 ms marketing up to Cloudflare's ~500 ms with Llama Guard 3 8B in path, with NeMo's five-rail parallel orchestration around ~0.5 seconds. The AREB methodology is designed so a real classifier adds its inference latency plus single-digit milliseconds of framework — Phase 2 v2 (with real Llama Guard 3 1B) will measure this empirically and is expected to land near ~167 ms PEP-A (≈165 ms classifier + ≈2 ms framework, subject to the GIL caveat above). That's the contract this methodology will publish.
The category AREB occupies — paired pre-action and post-action enforcement on a realistic agent workload, with all costs accounted for on the critical path — now has its first published reference numbers. We invite reproductions.
What's next
Phase-3 — model coverage. Run the same sweep against Anthropic Claude, a self-hosted Llama-class model on vLLM, and gpt-4o (the larger sibling of gpt-4o-mini). The methodology is upstream-agnostic; the comparator numbers across models will be interesting.
Phase-4 — multi-step agent trajectories. The current sweep measures per-step overhead. The real production case is a multi-step agent making multiple LLM and tool calls per user request. We need a methodology extension for trajectory-level overhead, with caching and decision propagation across steps explicitly modelled.
Bundle catalogue contributions. The eight bundles in v0.1 cover the policy-complexity range we cared about for v1. The interesting future additions are real-classifier-backed bundles for specific industries — financial services PII, healthcare PHI, EU AI Act compliance. Each is a small, well-defined contribution.
Algedonic.ai is building the policy enforcement layer for autonomous AI agents in regulated environments. AREB is the benchmark we wished existed when we started.

