Beyond RAG: The Rise of Context Engineering

The shift from RAG to context engineering is not a fashion change — it is a re-recognition that the prompt is a system, not a string.

For about two years, every team building serious LLM applications has been doing some version of the same thing: chunk documents, embed them, do top-k cosine similarity at query time, stuff the chunks into a prompt, hope for the best. Retrieval-augmented generation was the dominant pattern, the default architecture, and — depending on which thread you read this week — either the bedrock of modern AI engineering or a dead concept that hot takes love to bury.

The reality is more interesting. RAG did not die. The field grew up around it. The discipline that has emerged — what Anthropic, the academic survey literature, and most serious practitioners now call context engineering — treats retrieval as one component among many, and treats the entire bundle of tokens flowing into the model as a system to be designed, budgeted, and measured. This post lays out what that discipline actually is, why naive RAG broke down, and a mental model engineers can use to design context pipelines that work in production.

The “RAG is dead” debate misses the actual shift

The hot-take version of the past year runs like this: long-context models with million-token windows arrived, you can just dump everything in, RAG is obsolete. The serious version is the opposite. Bigger windows made it more obvious that raw context capacity and useful context are different problems, and that solving the second is the engineering challenge.

The signal came from two places. First, Liu and colleagues’ 2023 paper Lost in the Middle showed empirically that LLMs attend better to information at the start and end of a long context than in the middle — a U-shaped attention curve that persisted across model families and did not go away simply by making the window larger. Second, Chroma’s 2025 Context Rot research tested 18 frontier models and found that performance degrades as input length grows on every single one, often well before the nominal context limit. The decline is not gradual either; some models hold steady and then drop sharply.

Both findings point to the same conclusion. The context window is a finite attention budget, not a bigger bucket. You can pour a million tokens into a 1M-context model, but the model still has to decide what to look at, and naive stuffing makes that decision worse, not better. That observation is what context engineering rests on.

What context engineering actually means

Anthropic’s engineering team published a reference guide on this in late September 2025, and Mei et al.’s survey paper gave it an academic skeleton in July. The definition that emerges across these sources runs roughly: context engineering is the discipline of optimizing the entire set of tokens given to an LLM at inference time — instructions, retrieved data, tool definitions, conversation history, memory — to maximize task performance while respecting context as a finite, costly resource.

A useful way to position it relative to neighboring terms:

Prompt engineering is about one string — how you phrase the instruction. Tactical, often manual, often disposable.
RAG is one specific technique inside context engineering: fetch external documents at query time, paste them into the prompt.
Context engineering is the system-level discipline that owns all five sources of tokens — system instructions, retrieved data, tool results, conversation history, and memory — and treats them as a budget to be allocated.

The reframe matters because the failure modes change once you accept it. A bad RAG result used to be diagnosed as “the retriever returned the wrong chunk.” With a context-engineering lens, the same incident gets diagnosed as some combination of: bad retrieval, bad ranking, missing query rewrite, stale memory, redundant tool output, or compaction loss. All five layers can fail independently. You cannot fix the system by tuning one of them in isolation.

Why bigger context windows did not fix it

A million-token context window does not let you skip context engineering. Three reasons, in order of practical importance.

First, attention is not uniform across the window. Lost-in-the-middle, context rot, the U-shaped curve — by whatever name, the empirical fact is that model accuracy degrades with input length and degrades worse for information not at the edges. Architectural choices like causal masking and rotary position embeddings create this bias structurally; it is not eliminated by scaling.

Second, the economics get worse fast. Input tokens cost real money. A million-token prompt can be an order of magnitude or more expensive per call than the same query handled with a tightly retrieved 20K-token context. For interactive applications with non-trivial query volume, the difference is the difference between a viable product and one that bleeds margin.

Third, noise compounds in agentic loops. An agent running a multi-step task accumulates tool results, search hits, intermediate reasoning, and partial outputs in its working context. Anthropic’s post on long-running agents calls this the central engineering problem of long-horizon work: context fills with low-value material, signal-to-noise falls, the model starts compounding small errors, and accuracy drops. The fix is not a bigger window. The fix is active context management.

A working mental model: the context budget

The cleanest mental model I have seen — and the one closest to how teams that ship reliable agents actually think — is to treat the context window as a budget with five line items:

System instructions. Identity, capabilities, tools, output format. Should be terse, durable, and free of contradiction. Anthropic’s guidance calls for the “right altitude”: specific enough to constrain, general enough to flex.
Tool definitions. Schemas the model reads to decide what to call. Every tool definition costs tokens. Curate aggressively. Production agents tend to enforce hard limits because tool-selection accuracy degrades past a few dozen options.
Retrieved data. Documents, code, knowledge-base hits — the traditional RAG payload. Should be reranked, deduplicated, and positioned near the start or end of context, never the middle.
Memory. Anything that persists across sessions — user facts, project state, prior decisions. Distinct from retrieval because it is stable, identity-bound, and usually small.
Conversation history. Prior turns, tool results, intermediate reasoning. The category that grows fastest in agentic loops and is the most aggressive consumer of budget.

The unifying principle, in Anthropic’s framing, is to find the smallest set of high-signal tokens that maximize the likelihood of the desired outcome. Every layer is a candidate for compression, deduplication, or replacement with a tool call that fetches on demand.

Patterns that earn their keep

A short list of techniques that have moved from “interesting paper” to “expected by production users”:

Reranking on top of retrieval. Vector search is fast and lossy. A cross-encoder or LLM-based reranker over the top fifty hits, narrowed to the top five, consistently beats top-k retrieval alone. The improvement comes from the reranker actually reading the query and document together rather than relying on embedding similarity.
Query rewriting and decomposition. Real user queries are messy, ambiguous, and often multi-hop. Rewriting the query — or generating a hypothetical answer to embed, in the HyDE pattern — before retrieval routinely outperforms searching the raw query.
Just-in-time loading. Instead of dumping all potentially relevant context upfront, give the agent tools to fetch what it needs — and trust it to ask. This is the operating model behind Claude Code and most agentic coding tools, and it generalizes well to enterprise agents.
Compaction and summarization. For long-horizon tasks, periodically compress the conversation history into a structured summary. Done well, this preserves decisions and discarded options while shedding raw tool outputs. The Agentic Context Engineering paper by Zhang et al. formalizes one version of this as an explicit generate-reflect-curate loop.
Hierarchical memory. A short-term working buffer, a session-level summary, and a long-term persistent store, each with different write and retrieval rules. The MemGPT pattern popularized this design and most production memory systems now borrow from it.
Position-aware ordering. Given lost-in-the-middle, place the most important retrieved context at the end of the prompt, just before the user’s question. The empirical lift is often larger than swapping retrievers.
Prompt caching. Stable parts of the prompt — system instructions, tool definitions, large reference documents — can be cached at the model layer. Both Anthropic and OpenAI support this. It is the cheapest meaningful optimization most teams have not enabled.

The right combination depends on the task, but the pattern is consistent: design retrieval, ranking, compression, memory, and tool use as one pipeline, not as five disconnected choices.

When you should skip RAG entirely

A pragmatic counterweight to the discipline above: there are cases where RAG is overkill and a simpler design wins.

The full document fits in context and the task is one-shot. Loading a fifty-page contract and asking specific questions is often better done by passing the whole document. RAG adds latency, chunking artifacts, and an extra failure mode for nothing.
The model already knows the material. Standard documentation for popular libraries, common protocols, and well-known APIs is in training data. RAG’ing the React docs is usually worse than just asking the model.
The task is deterministic. If the answer is “look up record X and return field Y,” call the database directly. Embedding-based retrieval is a probabilistic last resort for unstructured corpora, not a replacement for keyed lookups.
The corpus is small and stable. Under a few hundred documents, you can often match or beat a vector store with BM25 or even keyword matching — at a fraction of the operational cost.

The point is not that RAG is bad. The point is that RAG earns its place when the corpus is large, dynamic, or unstructured, and quietly gets in the way when it is not.

The teams that ship reliable LLM applications in 2026 are not the ones with the cleverest retrievers or the largest context windows. They are the ones who treat the context window as a budget, instrument every layer, measure what each token is doing for them, and accept that the model’s behavior is downstream of what you put in front of it. RAG is a tool inside that discipline, not a synonym for it — and the engineering teams that internalize the difference will spend the next year shipping things their competitors cannot.