The AI Gateway: The New Infrastructure Layer Every Enterprise Will Need

The infrastructure layer most teams will build twice — once wrong, then once with intention.

There is a recognizable arc in the technology industry. A new class of workload emerges. Teams build it ad-hoc, scattered across application code, each service wiring up its own connections and policies. Two years later, the pattern repeats often enough that a new infrastructure layer appears with a name: load balancer, API gateway, service mesh, secrets manager. The pattern is being repeated, again, for LLM traffic.

The term that has stuck is “AI gateway,” though “LLM gateway” and “LLM proxy” are used roughly interchangeably. Whatever you call it, the function is the same: a piece of infrastructure between your applications and the model providers — OpenAI, Anthropic, Bedrock, self-hosted endpoints — handling cross-cutting concerns every team would otherwise reimplement. Cost control. Fallbacks. Rate limiting. PII redaction. Audit. Caching. Model routing.

Every engineering leader I have talked to in the past six months has asked some version of the same question: do we need one, and if so, build or buy. The answer depends on what you actually want it to do, which is the part most posts on this topic skip. So let us start there.

Why AI gateways exist as a category

For the first wave of production LLM applications, the integration looked like this: import the OpenAI SDK, paste in an API key, ship. That worked for prototypes and surprisingly well for single-team applications. Then several pressures arrived at once.

Multiple providers became normal. Teams that started on OpenAI began calling Anthropic for some workloads and a self-hosted model for others; suddenly each application had three SDKs and three sets of error handling. Compliance turned into a real concern in regulated industries, where sending customer data to a third-party endpoint with no audit trail was no longer acceptable. Cost ran away — production LLM bills became one of the larger line items in many engineering budgets, and nobody could tell which team was responsible for which share. And providers started having outages. “What happens when the model API is down” stopped being theoretical the first time a major provider’s region went dark.

These pressures share a property: they are cross-cutting concerns that do not belong in application code. The same kind of pressure produced API gateways for microservices, authorization proxies for identity, and observability platforms for metrics. AI gateways are the same evolution applied to LLM traffic.

The category is real even though the vendor landscape is messy. LiteLLM is the open-source reference implementation most teams encounter first. Cloudflare, Kong, and AWS have produced AI gateway offerings that extend their existing API gateway products. Purpose-built entrants like Portkey, Helicone, and TrueFoundry have built standalone products. “What is an AI gateway?” gets a different answer from each, and that is part of the confusion.

What an AI gateway actually does

A useful taxonomy strips the vendor framing away and asks what functions an AI gateway provides. They fall into six categories.

Unified interface. A single API your application code talks to, regardless of which provider serves the request. The de-facto standard is the OpenAI Chat Completions format — most gateways present that interface and translate to whichever provider the request is routed to. The smallest possible AI gateway, and the one most teams build first themselves.

Routing. Which provider, model, and region serves a given request. Routing can be static (production goes to one model, staging to a smaller one) or dynamic (by task class, by user tier, by cost cap, by current provider health). This is where the gateway earns its keep — every team eventually needs this, and reimplementing it across services is wasteful.

Fallbacks and retries. When a provider returns a rate-limit error, timeout, or content-policy refusal, the gateway can retry on the same provider or fall back to a different model. The discipline that matters: separate fallbacks for transient failures (timeouts, rate limits) from fallbacks for policy failures (content refusal, context window exceeded). Each calls for a different fallback model. LiteLLM, for instance, distinguishes general fallbacks from context-window fallbacks from content-policy fallbacks — the shape most production setups end up needing.

Cost control. Per-team, per-application, per-user budgets and rate limits, with visibility into who is spending how much. Enforcement has to live in the gateway — application-level cost limits do not survive long because they rely on every team adding and maintaining the same code.

Audit and observability. Every request and response logged with enough context to investigate later: which user, which team, which model, which version, latency, cost, errors. This is the layer regulated industries care about most, and the layer most build-it-yourself gateways underinvest in.

Redaction and guardrails. Detecting and redacting PII or other sensitive data before it leaves your network. Blocking prompts that match policy patterns. Enforcing that certain traffic classes never route to certain providers. The controls that turn an AI gateway into “the only way compliance signs off on production.”

A useful test: if a team is implementing more than two of these functions in application code, they probably want a gateway. If they are implementing four or more, they almost certainly do.

Where it sits: the AI gateway and your API gateway

The most common confusion among architects is whether an AI gateway replaces, complements, or sits inside an existing API gateway. The honest answer is “complements, usually, but not always.”

API gateways are built around HTTP requests to backend services: authentication, request routing, rate limiting per route, request transformation. They are excellent at this, and most enterprises already have a mature one running.

LLM traffic has properties traditional API gateways do not handle natively. Responses are streamed token by token, not returned as a single payload — many older API gateways struggle with server-sent events and long-lived connections. Rate limiting needs to happen on tokens and dollars, not just request counts. Caching has to be semantic (similar prompts return similar responses), not URL-based. Content needs to be inspected for PII inside the request and response body.

The integration patterns that work in practice:

API gateway in front, AI gateway behind. External clients hit the API gateway for authentication, identity, and traditional rate limiting. The API gateway routes LLM-bound traffic to the AI gateway, which handles model routing, fallbacks, cost limits, and content inspection. The cleanest split, and the most common pattern in larger enterprises.
AI gateway only. For internal services calling LLMs, an AI gateway alone is often sufficient — the single egress point for model traffic. Internal identity is usually handled by your service mesh.
Unified gateway. Some API gateway products (Kong, Apache APISIX) have added LLM-specific plugins. If your API gateway already supports streaming responses and token-aware rate limiting, this is a reasonable consolidation — but only when you already operate the underlying gateway competently.

The wrong answer is “we will just add LLM handling to our existing API gateway” without checking whether the gateway handles streaming, token accounting, and content inspection at the level your workload needs. Many do not.

Build vs buy: a decision framework

The build-versus-buy question has a clearer answer than most infrastructure decisions. The decision rests on four properties of your situation.

How many of the six functions do you need? If you need only unified interface and basic routing, building is reasonable — a few hundred lines of Python wrapping the provider SDKs gets you most of the way. If you need fallbacks, cost control, audit, and redaction, buying or adopting an open-source gateway is almost always cheaper than building.

How much LLM traffic do you have? Below a meaningful spend threshold, the gateway is a nice optimization. Above it, the gateway becomes the layer that decides whether your AI bill is predictable. The break-even is where the savings on cost control alone exceed the cost of running it.

What is your compliance posture? Regulated industries with strict audit requirements should buy or adopt an open-source gateway with mature audit capabilities. The audit and redaction layers are the most expensive parts to build well, and the parts where the cost of bugs is highest.

Do you have a platform team? If yes, adopting an open-source gateway and operating it as part of your internal platform is the most common winning pattern. If no, a managed gateway is the answer — running infrastructure without a team to own it produces predictable outages.

The pattern that has emerged across most teams I see: start with LiteLLM in proxy mode as the reference architecture, prove out the functions you actually need, decide later whether to swap for a managed product or build something bespoke. The open-source option is good enough to learn on, and the migration cost away from it is low because every gateway presents the same OpenAI-compatible interface.

The pitfalls nobody warns you about

A few patterns that consistently bite teams that adopt a gateway without thinking through the second-order effects.

The gateway becomes a single point of failure. Every LLM call now depends on it. Plan for the gateway being down with the same care you would plan for your API gateway being down: redundancy, health checks, escape hatches. Some teams keep a “bypass the gateway” capability for narrowly-scoped emergencies; others rely on multi-region gateway deployments.

Latency adds up. The gateway adds tens of milliseconds in the best case, more if it does PII inspection or semantic caching synchronously. For latency-critical paths — autocomplete, voice agents — measure and verify; for batch and reasoning paths, the gateway’s latency cost is usually inconsequential next to the inference time itself.

Audit logs grow fast. Every LLM request, with full prompt and response, multiplied by every team using the gateway. The storage cost can rival the inference cost if you are not deliberate about retention and sampling. A common compromise is to log every request’s metadata but sample full payloads.

The unified API ages out. The OpenAI Chat Completions format is the de facto standard today; tool-calling and other features have already created provider-specific extensions that gateways translate inconsistently. The unified interface is useful but lossy. For workloads that depend on provider-specific features — extended thinking, specific tool-use formats, native multimodal — test that the gateway preserves them before assuming the abstraction is free.

The teams that get the most value from an AI gateway treat it like any other piece of platform infrastructure: a deliberate investment in cross-cutting concerns, owned by a platform team, evolved with the workloads it serves. The teams that get the least value are the ones that bought a gateway because the category sounded important, then never moved the responsibilities they were supposed to move into it. The gateway itself is not the value. The discipline of treating LLM traffic as production infrastructure is. The gateway is just the natural place for that discipline to live.