← What's New

The Dreaming Pattern: When AI Agents Learn From Their Own Sessions

Anthropic's Dreaming lets agents learn from their sessions. The taxonomy of session-replay learning patterns, where each pays off, and the eval that proves it

The agents are working all day. The interesting question is what they should do at night.

On May 6, 2026, at the Code with Claude conference in San Francisco, Anthropic announced a feature called Dreaming as part of Claude Managed Agents. The mechanism is straightforward to describe and easy to underestimate. Between sessions, a scheduled background process reviews the agent’s past work — successes, failures, tool calls, retrieved memories, corrections — and curates the agent’s memory store. Patterns get extracted. Frequently-retrieved memories get reinforced. Stale or contradicted memories get pruned. Workflows that multiple agents independently converged on get promoted into shared learnings. The agent wakes up — to use the metaphor Anthropic deliberately invokes — slightly better at its job than it was the day before.

The early numbers Anthropic disclosed are large enough to be worth examining and skeptical enough about to be worth scrutinizing. Harvey, the legal AI company, reportedly saw task completion rates rise roughly sixfold after implementing Dreaming. Wisedocs cut document-review time by half using Anthropic’s separately-launched Outcomes feature. These are vendor-reported numbers, not external benchmarks, but they are large enough that the underlying mechanism — whatever exactly it is doing — deserves serious attention from anyone building production agents.

The deeper question is whether Dreaming is genuinely a new technique or whether it is a productized version of patterns that have been circulating in the research literature for two years. The honest answer: both. The technique itself is part of a broader family of approaches that this post will taxonomize. Anthropic’s contribution is to package one specific approach as a first-class platform primitive and to demonstrate that it can be operated reliably at scale. The rest of the family — offline distillation, in-context retrieval, fine-tuning on traces, prompt-pattern extraction — has been studied in research papers but rarely deployed in production. The Dreaming launch is the moment session-replay learning becomes operationally mainstream. This post is the working engineer’s map.

What Dreaming actually is

Strip away the metaphor and the mechanism is precise. Dreaming is a scheduled process that operates on three inputs: the agent’s session transcripts, its accumulated memory store, and metadata about task outcomes (which calls succeeded, which were corrected, which tools failed, which memories were retrieved and used). The process runs between sessions — explicitly not during them — and produces curated updates to the memory store. The updates can land automatically or be gated by human review.

Three things make this different from conventional agent memory. First, it operates across sessions rather than within one. Most agent memory systems retain what happened in the current run; Dreaming consolidates patterns across many runs. Second, it operates across agents. Workflows that several specialist agents independently converged on get promoted into shared memory, available to every agent. Third, it actively restructures memory rather than just appending to it. Stale entries get pruned. Contradictions get resolved. High-signal entries get reinforced through repetition.

The architecture maps onto a standard layered view of agent memory. In-context memory holds the current conversation. External memory holds persistent information retrieved at runtime. In-weights memory is what the model learned during training. Dreaming sits on top of external memory as a maintenance layer — a process that periodically reorganizes, updates, and improves what gets stored there. It does not touch the model weights.

This last point matters because it constrains what Dreaming can and cannot do. It can reliably accumulate domain-specific knowledge that retrieves into context (workflow patterns, error recovery strategies, project-specific conventions, tool quirks). It cannot give the model new fundamental capabilities. An agent that cannot reason well about a domain will not learn to reason well about it through Dreaming alone; an agent that already reasons competently and just lacks specific knowledge will improve quickly.

The taxonomy: four session-replay learning patterns

Dreaming is one of four broad approaches to using session data to make agents better. The patterns differ in where the learning is encoded, how expensive each update is, and how the improvement composes with other techniques.

PatternWhere learning livesUpdate costStrengthWeakness
Offline distillationModel weights (student LLM trained on traces)High (training run)Permanent capability gain; cheap at inferenceStatic once trained; needs lots of clean traces
In-context retrievalExternal memory; retrieved at runtime (ExpeL-style)Low (just indexing)Updates instantly; transparentContext-window cost; quality varies with retrieval
Memory curationExternal memory; restructured between sessionsMedium (scheduled job)Cross-agent learnings; pruning of stale dataQuality depends on rubric; can drift over time
Prompt-pattern extractionPrompt templates; extracted from failuresLow to mediumMost interpretable; auditableDoesn’t capture nuanced reasoning

The pattern Anthropic shipped as Dreaming is memory curation, with a specific implementation choice (scheduled background job, with human-review gating optional). The pattern Reflexion popularized in research is in-context retrieval (retrieve past failures and successes at runtime). The pattern in AgentArk and similar academic work is offline distillation (train a student model on filtered teacher trajectories). The pattern Anthropic’s separate Outcomes feature implements is closer to prompt-pattern extraction (a grader agent reviews each output against a rubric and feeds corrections into the next run).

Most production systems will want at least two of these in combination. The choice of which two depends on the workflow.

Where each pattern pays off

The four patterns have different fit zones. Choosing the wrong one for the workload is the most common reason teams try session-replay learning and walk away unimpressed.

Offline distillation pays off with high volume, stable workflows, and tight latency or cost constraints. The fine-tuned student model is permanently better at the specific task and cheap to run. The cost: retraining is operationally heavy — you cannot iterate in hours, only days or weeks — and you need thousands of high-quality trajectories. The classic example is a customer-support agent on a stable product surface.

In-context retrieval pays off when the task surface is diverse and the team needs to iterate fast. Add a new pattern to the retrieval index and the agent uses it immediately on the next call. The cost is context-window tokens (each retrieved trace consumes input tokens) and retrieval quality (irrelevant traces actively hurt). The classic example is a coding agent on a varied codebase where relevant prior experience changes per task.

Memory curation — the Dreaming pattern — pays off when agents run continuously, learnings should be shared across multiple agents in a fleet, and the tasks evolve gradually rather than abruptly. The cost is that the curation logic is meaningful engineering and curated memory can drift if the rubric is wrong. The disclosed examples (Harvey on legal drafting, the drone-landing demonstration) share characteristics of long-running, repeated execution where the agent encounters the same kinds of edge cases over and over.

Prompt-pattern extraction pays off when you need interpretability and auditability more than peak performance. Each rule the agent follows is explicit, human-readable, and reviewable. The cost: explicit rules are coarser than learned representations; they handle common cases well and miss nuanced ones. The fit zone is regulated industries — compliance, healthcare, finance — where every rule the agent follows needs to be defensible.

The combinations that work in practice: memory curation plus in-context retrieval (Dreaming maintains a high-quality store, retrieval serves it at runtime — essentially Anthropic’s full stack); prompt-pattern extraction plus offline distillation (extract rules from failures, bake the highest-leverage rules into a fine-tune); memory curation plus prompt-pattern extraction (let memory accumulate patterns, manually promote the best ones into explicit rules). The combinations that do not work as well: stacking all four (each layer dilutes the others); offline distillation without retrieval (the model gets locked in and stale).

Eval methodology: proving it isn’t snake oil

The “agent improved by 6x” claim has roughly the same epistemic status as a startup pitch deck unless it is paired with a real evaluation framework. Anthropic has not published a benchmark for Dreaming, and Harvey’s number is reported by one customer about one workload. The honest evaluation methodology for any session-replay learning system has four components.

First, separate held-out tasks. Tasks the agent has not seen and that will not be added to memory during evaluation. Without this, you are measuring memorization, not learning. The cleanest pattern is to designate a portion of incoming work as evaluation-only — the agent runs it, memory is not updated from it, and the results are measured against ground truth.

Second, baseline against the same model without curation. If the underlying model improved during the evaluation window (Claude Opus 4.7 replacing 4.6, for instance), the improvement attributed to Dreaming is contaminated. The defensible comparison is the same model, same prompt, same tool set, with curated memory versus without. Anything else conflates dreaming with capability drift.

Third, track drift over time. Memory curation is a process; processes have failure modes. Memory entries can accumulate that are wrong, get reinforced because they are retrieved frequently, and gradually degrade performance. The metric to watch is task success on a stable held-out set, plotted over weeks. If it climbs, then plateaus, then dips, the curation has drifted and the memory needs auditing. If it climbs and stays climbed, the curation is working.

Fourth, break the task success metric into cost and quality. A more capable agent that consumes ten times the tokens is not necessarily a better agent. Track tokens per successful task, latency per successful task, and tool-call count per successful task alongside the success rate. Dreaming should improve at least one of cost or quality; if it improves neither, the system is theatre.

The Outcomes feature Anthropic shipped alongside Dreaming is, when used together, partly a built-in evaluation harness for Dreaming. Outcomes scores each agent run against a rubric; that score becomes the signal Dreaming uses to decide which memories to reinforce. Teams that deploy Dreaming without an Outcomes-style rubric are flying blind on both signal and validation. Teams that deploy both get a continuous loop where the grader catches drift in the memory and the memory accumulates the patterns the grader rewards.

Privacy and security gotchas

A self-improving agent that learns from production traces is a data-flow architecture, and data-flow architectures have specific failure modes that the marketing language tends to gloss over.

The first is the cross-customer leakage risk. If an agent fleet serves multiple customers and Dreaming pulls patterns across sessions, the patterns derived from Customer A’s data are now in the memory the agent uses when serving Customer B. The mitigation is strict tenant scoping at the memory layer — Dreaming runs per-tenant, memory stores are per-tenant, the cross-agent learning is across the same tenant’s agents, not across all tenants’ agents. Most platforms get this right when asked specifically; few get it right by default.

The second is sensitive-data accumulation. The traces from production contain whatever production data the agent processed. Names, addresses, financial details, medical information. The memory store now contains those. The retention, encryption, deletion, and access-control posture for that memory store needs to match the posture for the primary data store. Treat the memory as production data. It is.

The third is adversarial memory poisoning. If an attacker can put text in front of the agent, that text can end up in the memory if it survives Dreaming’s curation. The mitigation overlaps with the prompt-injection defense playbook from the broader agentic-security literature — memories sourced from untrusted content need provenance metadata, and the curation step needs to weight memories from trusted sources higher.

The fourth is audit trail completeness. A regulated organization needs to be able to answer “why did the agent do that” for any agent action. Dreaming-curated memory complicates this because the memory was constructed by a process, not by an identifiable human. The mitigation is to retain the unredacted session transcripts that the curation operated on, along with the curation decisions, so the audit trail can reconstruct why a particular memory exists.

Whether Dreaming becomes a foundational pattern or a clever feature that competitors clone within six months will depend less on the underlying technique and more on whether teams build the eval and governance scaffolding around it. The technique itself is real and useful. The risk is that teams adopt it for the productivity story without the rigor — running an agent fleet that “learns from experience” with no way to audit what it learned, no way to detect drift, no way to prove the improvement is real. That is the version of session-replay learning that produces a 6x number on one customer’s slide deck and a quiet regression on everyone else’s production traffic six months later. The teams that get this right will treat Dreaming as the start of a workflow, not the end of one. Memory curation runs. Outcomes grades the curated agent. Held-out evaluation catches drift. Audit logs make the agent’s behavior reconstructible. Sensitive data stays scoped. The agent gets better. The system you operate is one you can still defend a year from now. That is the actual win the new pattern offers, and it is bigger than any vendor-reported multiplier.