Memory Architecture for AI Agents: The Real Hard Problem
Context windows are not memory. A four-tier model from cognitive science, the storage choices that follow, and the security question every team is missing
TL;DR
Reading the post…
The longer your agent runs, the more obvious it becomes that “context window” and “memory” are different things.
For most of 2024, the answer to “how does an AI agent remember things” was a shrug and “we just keep adding to the context window.” It worked for short interactions, masked the problem for medium ones, and broke completely the moment agents started operating across sessions, across users, or across time horizons longer than a model could reasonably hold in attention. The industry tried to solve it by making the context window larger. Frontier models now offer windows in the millions of tokens. The problem persisted. Bigger context windows are not memory in any useful sense — they are larger short-term buffers.
The interesting work in 2026 has moved into a more disciplined direction: borrowing the multi-tier memory model from cognitive science and applying it deliberately to agents. The canonical academic framing is the CoALA paper (Cognitive Architectures for Language Agents) from Princeton and Carnegie Mellon, which adapted decades of psychological research on memory types into a usable taxonomy for LLM-based systems. Most major agent-memory frameworks — Letta, Mem0, Zep, LangChain’s memory abstractions — now build on this taxonomy, with implementation variations that are mostly cosmetic.
This post is the architecture reference. The four tiers of memory, the storage choices that follow from them, the retrieval-versus-rehydration tradeoff that determines real-world latency, and the security model that almost nobody is designing yet.
Why “longer context window” is not the answer
The seductive idea is that if context windows keep getting larger, the memory problem dissolves. Just put everything in the window. The math does not work out, for three reasons.
First, effective context is significantly smaller than nominal context. Models with ten-million-token windows do not actually attend evenly across ten million tokens. The “lost in the middle” problem is well documented; the practical limit is much smaller than the marketing number.
Second, cost scales linearly with the token count of every call. An agent that maintains a long-running conversation by re-sending the whole history pays for that history on every turn. The cost trajectory of a multi-day agent task with naive context management is brutal in a way that fixed-cost queries do not telegraph.
Third — and this is the key one — context windows are per session. Agents that need to remember across sessions, across users, or across operational restarts cannot solve that with a context window alone. The information has to live somewhere outside the model.
The reframe that has stuck: context window is working memory; everything else needs a different architecture. CoALA makes this point cleanly by mapping the cognitive-science distinction onto agents: working memory holds what is immediately needed for the current decision; long-term memory holds what persists across decisions, sessions, and lifetimes.
The four-tier memory model
The CoALA taxonomy distinguishes four memory types, and the distinctions are operationally meaningful rather than just academic. Each type has different access patterns, different storage requirements, and different failure modes.
Working memory. What is in the model’s context window right now: the current conversation, recent tool outputs, retrieved context, system prompt. Volatile, fast, constrained. Its limit is the effective context window. Most failures attributed to “the model isn’t smart enough” are actually working-memory failures: too much in the window, too little relevant, or the relevant piece buried in the middle.
Episodic memory. What happened, in what order, in what context. A log of past interactions, decisions, tool calls, and outcomes, indexed by time and by the situation that produced them. Episodic memory is what lets an agent answer “what did we decide about the diesel engine on Tuesday and why did we reject the alternative?” It is the agent’s experiential record, and it is the memory type most often confused with “RAG over chat history” — they are related but not the same. Episodic captures specific events with full context; RAG retrieves from a curated corpus.
Semantic memory. Facts about the world that are true independent of when or how the agent learned them. Customer profiles, product specs, organizational structure, domain rules. Semantic memory generalizes — what an agent “knows,” not what it “remembers experiencing.” A test for whether something belongs in semantic memory rather than episodic: would the fact still be true if the agent’s history had been different?
Procedural memory. How to do things. Skills, decision rules, workflows the agent has learned to follow. In practice, procedural memory for LLM agents is often encoded as system prompts, tool descriptions, agent skills, or learned routines — the “this is how we handle X in this organization” knowledge that is not a fact about the world but a fact about how the agent operates within it.
The four types interact. Episodic memory consolidates over time into semantic memory: enough specific events of “this user prefers concise responses” become the semantic fact. Procedural memory often distills from repeated episodic patterns: a procedure that has worked five times becomes a learned skill. The consolidation pathway is what distinguishes a memory architecture from a fancier log file.
Not every agent needs all four. A customer-service agent leans on episodic and semantic; a coding assistant on semantic and procedural. The right architecture maps memory tier to use case rather than implementing all four because the framework offers them.
Storage choices: vector, graph, SQL
The next architectural decision is what to store each memory tier in. Three storage primitives dominate, and the choice is not vendor-dependent — it is access-pattern-dependent.
Vector stores are the default for semantic search over unstructured content. Embeddings, similarity search, approximate nearest neighbors. Strong at “find me things that mean something like this.” Weak at structured queries, multi-hop reasoning, and questions that depend on relationships between entities. Use for: large bodies of unstructured episodic memory where the retrieval pattern is “find related past events”; semantic memory expressed as document chunks rather than structured facts.
Graph stores model entities and relationships explicitly. Strong at “who knows whom, what depends on what, what changed when.” Weak at fuzzy semantic search. Use for: semantic memory involving relationships between entities (customer to account to product, employee to team to project); episodic memory where the query pattern involves traversal. The operational cost is higher — graph databases are harder to run than vector stores — and entity extraction is imperfect, which means the graph contains its own noise.
Relational stores are the boring answer that turns out to be correct surprisingly often. SQL handles temporal filtering, structured facts, user metadata, audit logs, and joins between any of the above. Use for: anything where the access pattern is “give me the rows matching these criteria,” which describes more of agent memory than people expect. PostgreSQL with pgvector and a graph extension (AGE or similar) can serve all three patterns on one substrate; the operational savings are substantial for teams already running Postgres.
The pattern that has emerged across production deployments is hybrid. Most serious agent-memory systems run a combination — vector for fuzzy semantic candidates, graph for relationships, SQL for structured queries and audit. The orchestration layer routes each query to the right substrate, or queries multiple substrates in parallel and merges results. The framework you pick (Letta, Mem0, Zep, Cognee, custom) matters less than getting the routing layer right.
Retrieval, rehydration, and forgetting
Three operational disciplines determine whether the architecture actually works in production.
Retrieval versus rehydration. The naive approach to long-running agents is to rehydrate full context at the start of every session — pull in the whole conversation history, the user profile, the recent state. Works at small scales, breaks fast. The better approach is retrieval on demand: keep active working memory small, and pull from long-term memory only what is needed for the current decision. Retrieval looks like a tool call — the agent decides what to remember about, calls the memory layer, gets back relevant context, proceeds. Rehydration is for narrow cases where a known fixed corpus needs to be loaded; retrieval is for everything else.
Forgetting. The most under-designed part of most memory systems. Memory that grows without bounds becomes a contested resource: retrieval slows, more noise enters every recall, costs accumulate. Effective forgetting is selective: utility scoring based on recency, access frequency, downstream outcome, or explicit pinning. CoALA calls the operational pattern consolidation and eviction: episodes that produce stable patterns get consolidated into semantic memory; one-off episodes get evicted on a forgetting schedule. Building this in from day one is the difference between a memory system that ages well and one that becomes a landfill.
Write-time decisions. Every agent turn is a potential write. What gets stored, where, and at what fidelity is a decision per turn — and most failed memory systems make it once, badly. Storing every turn verbatim is expensive and noisy. Storing only summaries loses fidelity. The pattern that works: store the raw event in episodic memory cheaply, summarize asynchronously into semantic memory on a schedule, and let the consolidation layer decide what gets promoted to long-lived storage. Write-time decisions belong to a background process, not the agent loop.
The security model nobody designs
The architecture sections above are the engineering problem. The security section is the one most teams will not solve until something goes wrong.
A long-lived agent memory is, by definition, a persistent store of facts about users, sessions, decisions, and operational state. It carries the same data-protection obligations as any other production database — and most agent-memory implementations were built as quick experiments that never had those obligations layered in retroactively.
Four concrete decisions worth making now:
- Tenant isolation at the storage layer, not just the prompt layer. Two users’ memories must not be retrievable across the boundary. Relying on the agent’s prompt to “stay in the right user’s context” is the same anti-pattern as relying on application logic to enforce database authorization. Isolation belongs in storage, with the row-level security or per-tenant index discipline you would use for any multi-tenant data store.
- Memory expiry tied to the underlying data lifecycle. If user data is deleted from primary systems, the agent’s memory of that data must be deleted too. GDPR right-to-erasure obligations apply to derived stores. Build the expiry path early — much harder to bolt on after the fact.
- Prompt-injection-resistant memory writes. A user who can influence what gets stored in long-term memory can influence the agent’s future behavior across sessions — including other users’ sessions if isolation is weak. Memory writes need to be sanitized for prompt-injection content the way any external input is, and content should ideally be reviewed before being elevated to higher tiers of memory.
- Audit-grade retrieval logs. Every memory retrieval should be logged with enough context to investigate later: which agent, which user, which query, which result. When something goes wrong, “the agent remembered something it should not have” is unanswerable without retrieval logs.
The teams that will look smart in eighteen months on agent memory are not the ones who picked the most novel framework. They are the ones who recognized early that memory is a data system — with all the implications that brings — and built the infrastructure the framework itself never quite provided: tier discipline, storage routing, forgetting policy, tenant isolation, audit logs. The four-tier model is the conceptual lens. The engineering work is what comes after.