Agent Reliability Engineering: SRE for AI

Most agent failures look like ML problems. They are actually operations problems — and the discipline that solves them already exists.

A working demo of an AI agent is misleadingly easy to build. A reliable agent running in production for thousands of users, every day, against real data with real consequences, is one of the harder things in modern software engineering. The gap is not closing on its own. Capability improvements in frontier models have produced only modest reliability gains. Bigger models do not automatically make agents more reliable. Better engineering does.

The good news, if you like problems with well-trodden solutions, is that most of that better engineering is not new. It is Site Reliability Engineering, applied to a system whose failure modes are statistical and depend on prompts and tool outputs instead of CPU and disk. What is missing is the translation: which SRE concepts carry over, which need adjustment, and which agent-specific failures do not fit the old taxonomy.

Why agent reliability is a new problem

Traditional software fails deterministically. Same input, same output, same bug every time. ML systems fail probabilistically — same input, different output, drift over time. Agents fail in a third way that is worse than either.

A modern agent makes multiple LLM calls and tool calls per task, often in a loop, often with state that accumulates across calls. Each call is probabilistic. Each tool integration can fail independently. Errors compound multiplicatively along the trajectory: a ninety-five-percent per-step success rate over ten steps is a sixty-percent trajectory success rate. The failure surface is the cross product of model behavior, tool surface, user inputs, surrounding context, and the time-of-day randomness of every external API the agent touches.

Public examples of this gap arriving in the news are easy to find: coding agents that have deleted production databases, shopping agents that have placed unauthorized orders, customer-service chatbots that have given legally incorrect advice for months before anyone caught it. In each case the system had passed internal evaluation. In each case the demo was fine. In each case production was where the agent met inputs the eval suite had never seen.

The point is not that agents are dangerous. The point is that the SRE-style framing — design for failure, instrument before you trust, define your service levels before someone else defines them for you — is the right framing, and most teams shipping agents in 2025 do not yet apply it.

The failure-mode taxonomy

The first useful step is to name the failure modes specific to agentic systems. The traditional LLM failure list (hallucination, refusal, prompt injection) is necessary but not sufficient. Agents fail in additional ways that only emerge across multi-step trajectories:

Tool misuse. The agent calls the right tool with wrong arguments, calls the wrong tool, or fails to handle a tool error and continues as if the call succeeded. This is the most common production failure mode, and the most insidious — a single bad argument at step two silently corrupts every step that depends on its output.
Hallucinated tools. The agent invents a tool name or function signature that does not exist, then enters a retry loop trying variants. Tool-routing hallucination is often more expensive than output hallucination, because it burns context and cost without ever calling a real system.
Context loss. In multi-turn workflows the agent loses track of facts, constraints, or decisions established earlier. Symptoms look reasonable in isolation — only wrong relative to a context the evaluator also has to see.
Retry loops. A tool returns an error; the agent retries identically; gets the same error; retries again, sometimes for dozens of iterations. Bills arrive.
Drift under pressure. When users push back, RLHF-tuned models tend to capitulate. A polite but firm “are you sure?” can flip the answer — a real production risk in agents that handle approvals or policy decisions.
Plan corruption. The agent forms a plan early, learns something later that should invalidate it, and continues executing the original plan anyway.

A reliability-focused engineering effort treats each of these as a failure class to be detected, not a quirk to be tolerated. The detection layer matters as much as the prevention layer.

Designing SLIs and SLOs for agents

The single most useful thing you can do for an agent project is to define service levels before the demo is greenlit. The translation from SRE looks like this.

Service Level Indicators. Pick a small set of measurable signals that map to “is the agent doing its job?” The list that survives contact with production usually includes task success rate (percentage of trajectories that achieve the intended outcome, scored by a deterministic check, a human, or an LLM judge with calibrated agreement); tool-call success rate, broken down by tool; trajectory length, with a hard cap and special attention to the tail (retry loops show up here); cost per task; time to resolution; and escalation rate — what share of tasks hit a guardrail, hand off to a human, or trigger a fallback.

Service Level Objectives. Turn each SLI into a target. The hardest part is being honest about the numbers. “Ninety-nine percent task success” sounds great in a slide and is unachievable in any open-domain agent shipping today. The pattern that works is to define SLOs per workflow, not per agent — a refund-processing agent might commit to ninety-eight-percent policy compliance; a research-summary agent might commit to ninety-percent factual accuracy with citations. The target matters less than the discipline of measuring against it weekly and treating violations as incidents rather than tuning opportunities.

Error budgets. The SRE translation works almost unchanged. If your SLO is ninety-five-percent task success and you are at ninety-two percent this week, you have spent your error budget. New features pause; reliability work takes priority. Without this rule, agent quality drifts downward indefinitely and nobody can say when it started.

The eval harness

An agent without a continuously-running evaluation harness is not engineering. It is performance art. The harness has two layers that work together.

Offline evals. A curated test set of representative tasks, run on every prompt change, model upgrade, and tool addition. Public benchmarks like τ-bench for tool-agent-user interaction, SWE-bench for real-world coding tasks, and the Holistic Agent Leaderboard provide off-the-shelf starting points and useful comparability. Most production teams supplement these with a domain-specific eval set built from real (anonymized) production traffic. The discipline that matters: evaluations should run as automated gates in CI, not as occasional research exercises.

Online evals. Production traffic, scored continuously. A combination of deterministic checks (did the agent comply with policy X?), regression detection (did the success rate drop after the model upgrade?), and sampled LLM-as-judge evaluation on representative traces. The tooling has matured fast — LangSmith, Arize Phoenix, Langfuse, and several others have made the basic plumbing easy; the principles work the same regardless of vendor.

Two principles tie offline and online together. First, anything that can be measured offline should be. Online evaluation is for things you cannot pre-collect. Second, whenever an online failure surfaces a new failure class, add a test case for it to the offline suite. This is how the harness compounds: every production incident becomes a permanent regression test, and the eval suite turns from a thin starting point into a living artifact that captures everything the team has learned the hard way.

Fallback patterns and graceful degradation

The third SRE concept that carries over almost unchanged is graceful degradation. Real agents in production should have multiple fallback layers, not a single “did it work or not” path.

A working pattern, ordered from full capability to last-resort safety, has four levels. The primary path is the full agent with all tools, full context, frontier model. Bounded autonomy kicks in when the agent exceeds a step budget, cost budget, or hits a low-confidence threshold; it restricts the agent to a smaller toolset or a constrained workflow. Human-in-the-loop activates for high-stakes decisions — refunds above a threshold, destructive operations, policy edge cases. Requiring explicit approval here is not weakness; it is correct engineering for the current state of the technology. Static fallback is the final layer: when everything else fails, hand off to a deterministic workflow or to a “I cannot help with this; here is how to reach a human” path. Never to silent failure.

A short concrete checklist for the fallback layer that has saved teams I have worked with: every tool invocation has a timeout. Every tool error gets exponential backoff with a cap. Retry loops have a hard maximum count, enforced at the orchestration layer rather than at the model. Destructive operations require explicit confirmation. Cost has a per-task cap, and the agent terminates rather than continuing past it. None of these is sophisticated. All of them are usually missing the first time something goes badly wrong.

A maturity model from demo to production

Most agent projects fail at one of three transitions, not at the final deploy. A useful maturity model has four stages, each with artifacts you can point to:

Stage 1 — Demo. Works on hand-picked inputs. No instrumentation, no fallbacks. Cost and latency are anecdotal. Fine — for a demo.
Stage 2 — Beta. Curated eval set exists. Basic tracing. A handful of internal users. Failure modes have names but not yet metrics. Most projects stop here, which is where “we have an agent in production” begins to mean dangerously different things in different rooms.
Stage 3 — Production. SLIs defined, SLOs published, error budget tracked. Offline eval suite runs in CI. Online evals on sampled traffic. Fallback layers wired in. On-call rotation that knows what an agent incident looks like. Postmortems for failures, with new test cases added to the eval suite as the closeout deliverable.
Stage 4 — Reliable production. All of the above, plus shadow-mode testing of changes against production traffic, automated rollback when SLIs regress, and a regular reliability review treated with the same seriousness as a security review. Few teams are here yet; the ones that are tend to have prior SRE depth they ported over deliberately.

The model is diagnostic, not aspirational. Honest application of it usually reveals that the agent leadership thought was “in production” is actually at Stage 2, and the next investment should be the eval harness and the SLIs, not another tool or another fine-tune.

The teams that will be running reliable agents two years from now are not the ones with the best prompts. They are the ones who recognized early that an agent is a production system, that the operations playbook for production systems already exists and works, and that “the model isn’t quite there yet” is not an engineering excuse — it is the engineering problem itself. The discipline already has a name; we just need to apply it.