Multi-Agent Orchestration Patterns: The Architecture Reference You're Missing

Most teams build their second multi-agent system from scratch because they did not name the first one.

The “single agent does everything” architecture survived 2024. By the second half of 2025 it had run out of road for any task requiring reasoning across more than a few thousand effective tokens, parallel exploration of multiple hypotheses, or careful separation of generation from verification. Anthropic published a detailed account of how it built Claude’s Research feature using an orchestrator-worker pattern, with internal evaluations showing multi-agent systems substantially outperforming single-agent ones for breadth-heavy tasks. The era of agentic systems being “one model with tools” was over the moment a meaningful share of production traffic stopped fitting in a single context window.

Multi-agent systems answer the scaling problem. They also introduce a new one: coordination. Every system architect who has built one has confronted the question of how the agents should be arranged, and the field has spent the past year converging on a small set of recognizable patterns. The patterns are not new in computer science — they are recycled from operating systems, distributed databases, and classical AI literature — but the way they apply to LLM-based agents is novel enough that teams keep reinventing them under different names.

This post is the missing reference. The four canonical patterns, when each applies, the failure modes for each, and the identity-and-auth question that almost every architectural discussion currently skips.

Why the patterns are converging now

Three forces are pushing the multi-agent design space toward a shared vocabulary.

The first is that frontier model context windows are not infinite, and effective context (the part of the window the model can actually use well) is significantly smaller than the nominal limit. For tasks that need to consider more material than fits in effective context, multi-agent decomposition — spawning subagents with their own context windows — is the only way to scale. This is the load-bearing observation in Anthropic’s research-system post: the system used the parallel context of multiple subagents to think across a wider information surface than any single context could hold.

The second is that production reliability needs the discipline of separating generation from verification. A single agent that generates and self-checks is significantly worse than two agents doing those jobs independently. This is the structural argument for the coordinator/implementor/verifier pattern, which now appears in most well-built systems whether the team calls it that or not.

The third is that the open protocols have matured. The Agent-to-Agent (A2A) protocol, now under Linux Foundation governance, is the de facto wire format for cross-vendor agent communication. The Model Context Protocol (MCP) is the de facto interface for tool calls. Both are stable enough that architectural decisions can be made on patterns rather than vendor-specific implementations.

The result is that the language is converging even where implementations are not. The four patterns below are recognizable across LangGraph, AutoGen, CrewAI, Bedrock AgentCore, and Kiro’s agent system, even though each framework names them slightly differently.

The four core patterns

Supervisor-Worker (orchestrator-worker)

A single supervisor agent receives the goal, decomposes it into subtasks, delegates each subtask to a worker, and synthesizes results. Workers do not communicate with each other; all coordination flows through the supervisor. This is the pattern Anthropic uses for Claude’s Research feature, and the default starting point for most multi-agent systems.

When to use. Tasks that decompose cleanly into parallel sub-problems with low coupling. Breadth-heavy research, multi-source extraction, anything where the natural shape is “fan out, do the work, fan in.” The supervisor can be a stronger model coordinating cheaper workers — the cost-quality story is usually favorable.

Failure modes. The supervisor is a bottleneck; every interaction routes through it. Token cost is high because the supervisor’s context grows as synthesis progresses. Workers can drift from the supervisor’s intent without realizing it. Subagents spawned for trivial subtasks waste tokens — a well-tuned supervisor needs explicit “do not spawn below this complexity” guidance.

Coordinator/Implementor/Verifier (CIV)

Three roles, sometimes filled by three agents and sometimes by three passes of the same one: the coordinator plans, the implementor produces, and the verifier checks against an explicit specification. The output is accepted only when the verifier passes it; otherwise it returns to the implementor with the verifier’s reasoning attached.

When to use. High-stakes outputs where errors are costly and detection is hard: code with security implications, legal-adjacent text, anything that will be acted on automatically. Also a strong default for systems without good evals yet — the verifier substitutes for an eval suite while you build one.

Failure modes. Verifier and implementor can collude when they are the same model; the verifier sees the implementor’s reasoning and tends to rationalize agreement. Using a different model for verification is the cheap fix. Verifier loops can run unbounded; a maximum iteration count is mandatory. The pattern roughly doubles latency and cost; reserve it for calls where the cost is justified.

Blackboard

A classical AI pattern, revived. A shared workspace (the “blackboard”) holds the current state of the problem. Multiple specialist agents read from it, contribute their piece, and write back. Coordination is implicit — agents act when they detect a state that fits their specialty.

When to use. Problems where decomposition is not known in advance, different specialists are useful at different points, and the order of operations depends on intermediate findings. Investigation-style tasks (security triage, debugging across systems, complex troubleshooting) fit well.

Failure modes. Coordination is implicit and hard to debug. Without strict rules about who can write what, the blackboard becomes a contested resource. Race conditions and write-conflict patterns from distributed systems reappear, often without the team realizing it. Most production failures of this pattern are observability failures — nobody can answer “why did the system do that?” after the fact.

Pipeline

A fixed sequence of agents, each consuming the output of the previous and producing input for the next. The structure is known in advance; routing is deterministic. A UNIX pipeline of LLM calls.

When to use. Well-understood workflows where the shape of the work is stable: ETL-like extraction and transformation, content generation with predictable stages, structured pipelines that benefit from specialization at each step.

Failure modes. Inflexible. When work does not fit the predetermined shape, the pipeline fails badly rather than gracefully. Errors compound; a small misinterpretation at step two corrupts every subsequent step. The right test: do the vast majority of your tasks have the same shape? If not, prefer supervisor-worker.

Choosing: a decision lens

The pattern catalog matters less than the lens used to pick among them. The lens that has held up:

Coupling between subtasks. Low coupling, parallel decomposition possible → supervisor-worker. High coupling, sequential dependencies → pipeline. Coupling discovered at runtime → blackboard.
Stakes of the output. High-stakes, hard-to-detect errors → CIV layered on whatever else you use. Low-stakes, easy-to-detect errors → skip the verifier; let the system run faster.
Predictability of the workflow. Known in advance → pipeline or supervisor-worker. Discovered during execution → blackboard or supervisor-worker with a flexible plan.
Latency budget. Tight → pipeline (deterministic, parallelizable). Loose → CIV, or supervisor-worker with multiple iterations.
Cost ceiling. Strict → pipeline is cheapest, supervisor-worker is middle, CIV is most expensive. Multi-agent systems typically consume substantially more tokens than single-agent equivalents — Anthropic noted their research system used roughly an order of magnitude more tokens than a chat interaction, which is the right ballpark to plan around.

Most production systems compose patterns rather than pick one. A common shape: supervisor-worker for the overall structure, CIV applied to the high-stakes steps within it, a pipeline embedded in the worker layer for routine steps, and a small blackboard for the supervisor’s working memory across iterations. Composition is the norm; pure patterns are the exception.

Failure modes specific to multi-agent systems

A few problems appear across patterns and are worth naming because they are easy to miss until they cause an incident.

Context inconsistency. Different agents working in parallel can develop divergent views of the same underlying state. The supervisor thinks an API call returned X; one worker saw it return Y because of a transient error; another saw Z because of a different retry. Reconciliation logic is mandatory. A single shared “ground truth” memory store, written through one path, prevents most variants of this bug.

Spawn amplification. A small change to the supervisor’s prompt can cause it to spawn ten subagents instead of three for the same task. Cost explosions follow. The mitigation is an explicit token-and-spawn budget at the orchestration layer, enforced regardless of what the supervisor decides.

Cascading hallucination. Worker A misinterprets the supervisor’s instructions slightly. Worker B receives A’s output as input and amplifies the misinterpretation. By the time the supervisor reconciles, the original deviation is invisible. The fix is structural: workers should receive the supervisor’s original goal alongside their subtask, so they can detect when their work is drifting.

Tool-call inversion. A worker that is supposed to consult its supervisor before a destructive action instead takes the action and then reports it. Coordination patterns must explicitly mark which tool calls require approval and route them through the supervisor — and agents must be evaluated against the pattern, not just the outcome.

Debug opacity. A bug that occurs once across five subagents at indeterminate timing in indeterminate order is harder to investigate than the equivalent bug in a single-agent system, by an order of magnitude. The infrastructure that makes this tractable is trace-based observability: every spawn, every tool call, every message between agents recorded with enough context to replay later. Without it, multi-agent systems become unreviewable.

The identity question nobody plans for

The pattern catalogs in most articles stop at the previous section. The most important question for production deployment is the one that does not appear: when agent A delegates to agent B, who is acting on whose behalf, what is B authorized to do, and what record of that delegation exists for audit?

The protocols have not caught up to the patterns. MCP added OAuth as an optional authorization layer, but OAuth tokens do not carry delegation chains — they authenticate the current call, not the chain of authority that led to it. A2A defines how agents discover and call each other but treats identity as self-declared in the agent card. In a system where an orchestrator delegates to a specialist that calls an MCP tool, the audit trail of who authorized what is currently fragmented or missing.

The architectural decisions worth making now, even when you cannot fully solve the underlying problem:

Treat every agent-to-agent call as a security boundary. An agent that calls another should pass through some form of attenuated capability — the receiving agent should be able to do less than the caller, not more.
Maintain an explicit delegation log at the orchestration layer. Even if you cannot enforce cryptographically that agent B was authorized by agent A, record the chain. Audits will need it long before regulators ask.
Plan for the day the protocols mature. The A2A and MCP specs are evolving toward verifiable delegation; designs that already separate the call from the authorization to make the call will integrate new mechanisms more easily than designs that conflate them.
Beware tools that “just work” because no auth is required. If a subagent can call an MCP tool with no authentication, so can anything else that learns the endpoint. The current state of MCP authentication adoption is poor enough that this is a real exposure, not a theoretical one.

The teams that will look smart in two years are not the ones who picked the most fashionable pattern. They are the ones who picked deliberately, named the pattern in their architecture documents, and built the boring scaffolding — observability, identity, cost limits, the explicit delegation log — that the pattern itself did not provide. Multi-agent orchestration is now a recognized architectural category with shared vocabulary and shared failure modes. The vocabulary is the leverage. Once a team can name what it is building, the engineering problem becomes tractable, even when the patterns themselves are still settling.