Multi-Agent Systems
How networks of autonomous agents coordinate, fail, and occasionally outperform everything else
Lead Summary
A multi-agent system (MAS) is a computational architecture in which multiple autonomous software agents interact within a shared environment to solve problems that would be intractable, inefficient, or architecturally awkward for a single agent. Each agent perceives its environment, maintains internal state, and takes actions—independently or in coordination with others. The field draws from distributed computing, game theory, cognitive science, and organizational theory.
Multi-agent systems occupy a contested position in contemporary AI. They promise parallelism, specialization, and the capacity to handle tasks exceeding any single context window. They also introduce a new category of failure: not the failure of individual agents, but the failure of coordination itself. Empirical studies find 41%–86.7% failure rates across state-of-the-art open-source systems—yet the same architecture, properly designed, can outperform single-agent baselines by more than 90% on research synthesis tasks.
Understanding multi-agent systems means understanding when that gap closes, and why it so often doesn't.
Historical Development
The intellectual foundations of multi-agent systems predate modern AI by decades.
Carl Hewitt formalized the actor model in 1973 as a model of concurrent computation. In this model, actors are the fundamental unit of computation: upon receiving a message, an actor can make local decisions, create new actors, send messages to other actors, and determine how to respond to the next message. Crucially, actors only affect each other through asynchronous message passing—there is no shared mutable state, no need for lock-based synchronization. This model, extended in systems like Erlang/OTP, introduced hierarchical supervision trees where parent actors manage child failure through restart-or-escalate strategies.
Through the 1980s and 1990s, the field developed formal interaction protocols. Reid G. Smith introduced the Contract Net Protocol in 1980: a manager agent broadcasts a task, contractor agents bid, the manager awards contracts. FIPA (Foundation for Intelligent Physical Agents) later standardized this and other protocols, [gaining IEEE acceptance in 2005](https://en.wikipedia.org/wiki/Foundation_for_Intelligent_Physical Agents). The concurrent development of the Belief-Desire-Intention (BDI) architecture by Rao and Georgeff gave agents a formal cognitive model: Beliefs (what the agent knows), Desires (what it wants), and Intentions (what it has committed to pursue)—implemented in real-world applications including space shuttle fault diagnosis.
These classical architectures solved the question of how agents could coordinate without central control. What they could not anticipate was the arrival of large language models, which changed the nature of the agent itself.
By 2024–2025, LLM-based agentic systems had adopted the actor model as a natural runtime architecture—frameworks like Akka provide supervision, backpressure, and message-driven isolation suited to long-running agentic workflows integrating tool use and I/O operations. FIPA standards, meanwhile, largely gave way to new protocols oriented around natural language interaction. As of 2025, Ecma International published the Natural Language Interaction Protocol (NLIP), reflecting the shift toward language-centric agent communication. The classical problems of coordination, consensus, and fault tolerance had not disappeared—they had moved to a new substrate.
Core Concepts
Agents and Their Internal Architecture
An agent in the classical sense is an entity that perceives its environment and acts upon it. The BDI model provides one formalization: the agent maintains beliefs about the world (including inference rules), desires representing motivational goals, and intentions—the subset of desires to which the agent commits and actively pursues. This structure implements aspects of philosopher Michael Bratman's theory of human practical reasoning.
In the LLM era, these components manifest differently. The "belief state" is the context window; "desires" are encoded in system prompts and task specifications; "intentions" emerge through the reasoning trace and tool calls. The formalism is less explicit, but the underlying challenge—how an agent should reason about incomplete information while committing to particular action sequences—remains the same.
Coordination Mechanisms
Agents can coordinate through multiple mechanisms, which can be combined:
Direct communication — Agents exchange messages explicitly. The Contract Net Protocol is a canonical example: manager proposes, contractors bid, manager awards.
Shared environment / Blackboard — The blackboard model uses a shared data structure as the sole communication medium. Agents only interact indirectly by reading from and writing to the blackboard, without direct communication. Recent LLM-based implementations (bMAS, Stigmergic Blackboard Protocol) adapt this classical model, allowing agents to discover work and coordinate through environment-based signals rather than rigid orchestration. Blackboard-based architectures can match or exceed traditional orchestrator-worker performance while maintaining higher token efficiency, because agents pull only what they need rather than receiving everything through a central bottleneck.
Stigmergy — Indirect coordination mediated through the environment. Agents leave traces (analogous to pheromones) that stimulate subsequent actions by other agents, without direct communication. Termites build nests this way. In computer science, stigmergy underpins ant colony optimization and swarm robotics, enabling emergent coordination without global control.
Game Theory and Strategic Interaction
When agents have potentially conflicting objectives, game theory provides the analytical framework. A Nash equilibrium describes a state where no agent can improve its outcome by unilaterally changing strategy while others hold theirs. Multi-agent reinforcement learning extends this through stochastic games—combining the temporal dynamics of Markov Decision Processes with the strategic interdependence of normal-form games. Interaction objectives fall into three archetypes: collaborative (fully cooperative), adversarial (zero-sum), and mixed-motive (general-sum).
In collaborative MARL, emergent coordination is observed when decentralized agents learn to cooperate without explicit signals. Research identifies distinct phases in MARL dynamics: a coordinated and stable phase, a fragile transition region, and a disordered phase—separated by an "Instability Ridge" driven by kernel drift from other agents' policy updates.
Fault Tolerance and Consensus
Distributed multi-agent systems face fundamental challenges around agreement under failure. The Practical Byzantine Fault Tolerance (PBFT) algorithm guarantees safety when at most f agents are faulty out of 3f+1 total, achieving sub-millisecond latency increases while processing thousands of requests per second.
Raft, by contrast, is designed for crash-fault tolerance and is more understandable than Paxos—but Raft explicitly does not handle Byzantine faults. It assumes agents either follow the protocol or crash; it cannot handle agents that lie, omit information, or act maliciously.
Orchestration Architectures
The Orchestrator-Worker Pattern
The orchestrator-worker pattern is the most-deployed multi-agent architecture in production as of 2024–2025. A single orchestrator agent receives incoming tasks, classifies and decomposes them into subtasks, routes each subtask to specialized worker agents, aggregates outputs, and synthesizes a final response. Workers do not communicate directly with each other—all coordination flows through the orchestrator.
Anthropic's multi-agent research system exemplifies this in production: a LeadResearcher (orchestrator) decomposes queries into subtasks, spawns parallel subagents to search for information, synthesizes results, and decides whether additional research cycles are needed. Once sufficient information is gathered, outputs pass to a CitationAgent for attribution. In internal evaluations, Claude Opus 4 as lead agent with Claude Sonnet 4 as supporting subagents outperformed a single-agent setup by more than 90 percent.
The orchestrator's primary responsibilities are task classification, decomposition, and worker assignment. Effective decomposition requires providing each worker with: a clear objective, explicit output format specification, guidance on which tools and sources to use, and clear task boundaries. Vague objectives, unbounded scope, and missing output format specifications are the direct precursors of worker failure.
Centralized orchestrators create throughput ceilings. At ~3 seconds per routing decision with 20 agents waiting, maximum decomposition throughput is roughly 6.7 tasks per second. At 100+ concurrent requests, the orchestrator becomes the rate limiter. The orchestrator also faces a context window bottleneck: for tasks producing more than 50 intermediate results, synthesizing all worker outputs can exceed 128K-token context limits.
Context Scoping
In hierarchical orchestrator-worker systems, context scoping is a mechanism where the orchestrator passes only limited, task-specific context to sub-agents rather than the full problem context. This prevents sub-agents from being overwhelmed with irrelevant information, keeps token costs low, and allows workers to specialize on focused domains. In Claude Code's subagent architecture specifically, subagents operate as stateless workers in independent context windows—the main agent is the sole "State Awareness" owner. Output files serve as the explicit handoff mechanism between stages.
Supervisor versus Swarm
The supervisor (orchestrator-worker) pattern trades latency and token cost for routing accuracy and observability. Supervisor patterns require extra LLM routing calls (2 calls per domain interaction); swarm patterns allow direct agent handoffs (1 call per domain after initial routing). For a three-agent workflow, supervisor costs can run 4–6x a single-agent baseline. But supervisor routing accuracy is higher because routing is a dedicated, focused LLM task, and all routing decisions are centralized and visible in traces.
Suitability depends on task interdependency: orchestrator-worker fits workflows requiring dynamic routing, ordered execution, and conflict resolution; swarm fits independent or loosely-coupled workloads where routing logic is embedded in the task itself.
Deterministic Orchestration
Deterministic code-based orchestration—where workflow logic is predefined rather than LLM-driven—provides superior speed, cost, and predictability. The LLM is invoked only for bounded, well-defined subtasks rather than making orchestration decisions. This blueprint-first approach codifies operational procedures into machine-readable execution blueprints with explicit sequences, conditional logic, and decision points. In incident response tasks, multi-agent LLM orchestration with this pattern achieves 100% actionable recommendation rates versus 1.7% for single-agent approaches.
Frameworks
The orchestrator-worker pattern is the canonical abstraction across leading frameworks:
- LangGraph implements it via graph-based workflows with typed state channels and a central supervisor node
- OpenAI Agents SDK (March 2025) replaced Swarm with explicit handoff abstractions carrying conversation context
- CrewAI provides role-based agent teams with built-in delegation and memory
- Microsoft AutoGen supports multi-agent conversational patterns with human-in-the-loop
Standardization across frameworks suggests the pattern is architecturally sound. The tradeoff is vendor lock-in around each framework's routing and state management implementations—the frameworks have converged on the pattern while remaining operationally incompatible.
Protocols
Model Context Protocol (MCP)
MCP standardizes how agents interact with tools, resources, and external systems—an agent-to-tool relationship with implicit hierarchy. MCP is built on JSON-RPC 2.0 with schema enforcement, providing explicit handoff semantics and format guarantees that eliminate silent parsing failures. As of early 2026, over 500 public MCP servers are available, supported by Anthropic, OpenAI, and Google DeepMind. In December 2025, Anthropic donated MCP to the Agentic AI Foundation (AAIF), a directed fund under the Linux Foundation.
MCP lacks standardized discovery; tool discovery currently happens through host application configuration files, though an official MCP registry is planned.
Agent2Agent Protocol (A2A)
A2A standardizes peer-to-peer communication between independent, potentially opaque agents—preserving agent autonomy and enabling collaborative interaction patterns rather than tool consumption. A2A uses Agent Cards—JSON metadata documents published at well-known URIs describing capabilities, skills, and authentication requirements—enabling both online discovery and offline registry-based discovery.
As of early 2026, A2A has been adopted by 150+ organizations in production environments, with Microsoft, AWS, Salesforce, SAP, and ServiceNow routing real tasks between heterogeneous agents. Native A2A support is built into Google's Agent Development Kit, LangGraph, CrewAI, LlamaIndex Agents, Semantic Kernel, and AutoGen.
MCP is designed for standardizing how agents interact with tools. A2A is designed for standardizing how agents interact with each other.
Failure Modes
The MAST Taxonomy
The MAST (Multi-Agent System Failure Taxonomy) identifies 14 unique failure modes in multi-agent LLM systems, derived from analysis of 150+ execution traces across 7 popular frameworks (AutoGen, CrewAI, LangGraph), validated through 1,600+ annotated traces. Three structural categories:
System design issues — Disobey Task Specification (15.2%, the largest single category), Disobey Role Specification, Step Repetition, Loss of Conversation History, Unaware of Termination Conditions, Fail to Ask for Clarification.
Inter-agent misalignment — Conversation Reset, Task Derailment, Information Withholding, Ignored Other Agent's Input, Reasoning-Action Mismatch. This cluster accounts for 31–36.9% of all failures.
Task verification failures — Premature Termination, No or Incomplete Verification, Incorrect Verification (13.6%).
The taxonomy's central finding: these are not capability failures. They are structural coordination failures—specification problems, information flow breakdowns, and organizational design errors that would cause human teams to fail in exactly the same ways.
Context Isolation
Context isolation is the most frequent root cause of multi-agent system failures in production. When agents operate from different representations or definitions of the same data, conflicting but individually coherent answers emerge with no mechanism for conflict resolution. Cognition's "Flappy Bird" example is illustrative: one subagent builds a Super Mario background while another builds a non-game-asset bird—both locally coherent, globally useless—because neither had access to the shared design context.
Error Cascading and Conformity Bias
Two distinct failure dynamics compound the problem:
Error cascading occurs when a single agent's initial mistake propagates through downstream agents without detection. Each subsequent agent builds upon the faulty foundation, amplifying the original error.
Conformity bias drives agents to reinforce each other's errors rather than independently evaluating them. As an error cascades, the consensus accepting it grows stronger at each step—"consensus inertia" that makes false agreements increasingly difficult to correct. Injecting purely objective context can actually accelerate polarization: a "trigger vulnerability" where acknowledging the basis of a false consensus strengthens it.
Communication topology shapes propagation. Moderately sparse topologies suppress error propagation while preserving beneficial information diffusion; dense topologies increase consensus amplification; fully sparse topologies may fail to share corrective information.
Adversarial Persuasion
A single strategically designed adversarial agent can reduce system accuracy by 10–40% while increasing consensus on incorrect answers by 30% or more—through coherent, confident, misleading arguments, not classical prompt injection. This vulnerability persists even with inference-time enhancement techniques (Best-of-N, RAG), which may amplify attacks by increasing perceived credibility of flawed arguments. More agents and more debate rounds do not reliably mitigate.
Context Rot
Context rot—measurable accuracy degradation as token count grows—occurs around 100K tokens. The mechanism involves lost-in-the-middle attention gaps (models attend well to beginning/end but poorly to middle positions, with >30% accuracy drop for middle-position content), attention dilution, and distractor interference. Multi-agent systems consume approximately 15x as many tokens as single-agent baselines for comparable tasks, dramatically accelerating context rot exposure.
Format and Protocol Misalignment
Format mismatches at handoff boundaries include: extra text wrapped around JSON, new unrequested keys, strings instead of arrays, missing required fields, and enum violations. YAML outperforms JSON for LLM comprehension (up to 54% more correct answers) but YAML's whitespace sensitivity leads to silent parsing errors and lacks native schema enforcement. JSON is preferable for agent-to-agent pipelines where strict format consistency matters.
Mitigations
Structured Outputs
Structured output schemas with JSON schema validation reduce handoff failure rates. OpenAI's GPT-4o with strict JSON Schema enforcement achieved 100% schema compliance in evaluations, compared to under 40% without enforcement. The OpenAI Agents SDK exposes handoff schemas as tool parameters, validates returned JSON locally, and passes the parsed value to the handler with input_type defining the handoff schema in strict mode.
Verification Agents
The VeriMAP framework decomposes tasks into directed acyclic graphs where each subtask specifies verification functions alongside execution. Executor outputs are verified before handoff; agents can self-refine when outputs violate contracts; smaller models can participate effectively because explicit requirements enable validation. Verification-aware design reduces cascading failures from upstream errors.
Heterogeneous Agent Teams
Deploying agents based on different foundation models substantially improves accuracy. Benchmarks show heterogeneous combinations achieve 91% accuracy on reasoning tasks versus 82% for homogeneous teams. Architectural diversity enables teacher-student dynamics and reduces consensus amplification by forcing agents to reason from different computational perspectives. Moderate, deliberate disagreement—structured through heterogeneous design—corrects errors without polarizing outputs.
Observability
Agent observability requires distributed tracing because agentic systems span model APIs, vector databases, external tools, and sub-agents. Without a common schema and tracing infrastructure, you cannot reconstruct which tools were called, in what order, or where reasoning diverged from the intended plan. Each step should be recorded as a nested span capturing the decision to choose the next action, execution order, plan divergence, branching and sub-agent handoffs with correlation IDs, and stuck states.
When Multi-Agent Helps (and When It Doesn't)
The field has developed a genuine empirical debate about when multi-agent architectures provide value.
Where multi-agent wins: Anthropic's research system outperformed single-agent baselines by 90.2% on internal research evaluation—tasks requiring sustained, parallel information synthesis across multiple interconnected domains. Code Broker's hierarchical two-layer architecture with parallel specialists achieves state-of-the-art pass@1 scores on code generation (93.9% HumanEval, 83.1% MBPP). Specialized agents achieve 20% higher accuracy and 40% faster results on domain-specific tasks versus general-purpose single agents. Industry benchmarks report 3x faster task completion and 60% better accuracy on complex workflows. Databricks reported a 327% increase in multi-agent workflow usage in four months (June–October 2025).
Where single-agent is competitive: A Stanford study demonstrates that single-agent systems can match or outperform multi-agent systems on multi-hop reasoning benchmarks (FRAMES and MuSiQue) under equal reasoning-token budgets, grounded in the Data Processing Inequality. Reported multi-agent advantages on popular benchmarks are often artifacts of unaccounted computation rather than architectural benefits.
The debate between Cognition ("Don't Build Multi-Agents") and Anthropic (published one day later demonstrating 90.2% gains) illustrates the same principle: Cognition's case applies to conversational agents and coding tasks (short, coherent, context-sensitive), while Anthropic's case applies to research synthesis requiring comprehensive multi-domain analysis. Neither is universally right.
The benchmark definition crisis: 41%–86.7% failure rates across 7 state-of-the-art open-source multi-agent systems. Most published multi-agent work is evaluated on non-agentic tasks (static benchmarks like HumanEval), providing misleading guidance about where coordination provides value. Existing evaluations compare architectures using different prompts, tools, or computational budgets—conflating architectural effects with implementation choices. Multiple 2024–2025 benchmarking frameworks have emerged (MultiAgentBench, Gaia2, AgentArch) attempting to address this, measuring task completion, collaboration quality, and robustness across interactive scenarios.
Organizational Design
Conway's Law, formulated by Melvin Conway in 1967, states: "Organizations which design systems are constrained to produce designs which are copies of the communication structures of these organizations." System components must communicate with each other to ensure compatibility, so technical architecture reflects the social boundaries of the organization.
The same principle applies to multi-agent systems: the coordination structure of an agent system reflects and is constrained by the information flow it was designed around. An orchestrator that must route through a single coordinator will bottleneck; a swarm with no shared context will fragment. Coordination failures in MAST arise from organizational structure flaws—not individual agent limitations—just as human organizations fail when information flow breaks down.
The inverse Conway maneuver for agent systems would mean designing information architecture before agent boundaries: deciding what each agent must know, what it must not know, and what it communicates outward—before deciding how to implement those boundaries technically.
Current Status
Enterprise adoption is high but operational success is low. 72% of enterprise AI projects now involve multi-agent architectures, up from 23% in 2024. Yet less than 10% of enterprises successfully scale AI agents to production—suggesting the pattern is conceptually understood and widely deployed, while operational challenges (monitoring, failure recovery, resource management) remain largely unsolved.
The commercial frontier is moving fast. In early 2026, Anthropic ran an internal commercial pilot ("Project Deal") where AI agents completed 186 real transactions across 500+ listed items totaling just over $4,000 in one week—validating autonomous multi-agent negotiation and execution at small scale. The A2A protocol has reached 150+ production deployments with major cloud providers.
The field's central unsolved problem remains the benchmark definition crisis: without reproducible, agentic-task-oriented evaluation frameworks, it is difficult to know when a multi-agent architecture genuinely earns its coordination overhead, and when it merely adds cost and failure surface.
Key Takeaways
- Multi-agent systems promise parallelism and specialization but introduce coordination failures as a new failure category. Empirical studies find 41%–86.7% failure rates across state-of-the-art open-source systems, yet the same architecture can outperform single-agent baselines by over 90% on research synthesis tasks.
- The orchestrator-worker pattern dominates production deployments but creates throughput and context window bottlenecks. Effective decomposition requires providing each worker with clear objectives, explicit output format specifications, guidance on tools and sources, and clear task boundaries.
- Coordination failures in multi-agent systems are structural, not capability failures. The MAST taxonomy identifies 14 unique failure modes across system design issues, inter-agent misalignment, and task verification failures—specification problems and information flow breakdowns.
- Context isolation is the most frequent root cause of multi-agent system failures in production. When agents operate from different representations of the same data, conflicting but individually coherent answers emerge with no mechanism for conflict resolution.
- Multi-agent wins on parallel information synthesis and complex workflows but loses on short, coherent, context-sensitive tasks. Task-dependence is the key variable. Research synthesis and hierarchical code generation see 90%+ gains; conversational agents and single-hop coding see competitive performance from single agents.
- MCP standardizes tool-to-agent interaction; A2A standardizes agent-to-agent peer communication. MCP has 500+ public servers; A2A has 150+ production deployments with support from Microsoft, AWS, Salesforce, SAP, and ServiceNow.
Further Exploration
Foundational Concepts
- Actor Model — Carl Hewitt's 1973 formalization of concurrent computation
- Contract Net Protocol — Classical protocol for manager-contractor task allocation
- Belief-Desire-Intention Architecture — Formal cognitive model for agent reasoning
- Stigmergy — Indirect coordination through environmental signals
- Nash Equilibrium — Game theory framework for strategic interaction
Architectures and Patterns
- Anthropic Multi-Agent Research System — Production orchestrator-worker implementation with 90.2% improvement over single-agent
- Blackboard Model and bMAS — Shared environment-based coordination for LLM agents
- Supervisor vs Swarm Tradeoffs — Cost and latency comparison of two orchestration strategies
- Building Effective AI Agents — Guidelines for effective agent decomposition and worker assignment
- Claude Code Context Scoping — Stateless worker architecture with output file handoffs
- Deterministic Code-Based Orchestration — Blueprint-first approach achieving 100% actionable recommendations in incident response
Failure Analysis and Mitigation
- MAST: Multi-Agent System Failure Taxonomy — 14 failure modes across 150+ traces in 7 frameworks
- Adversarial Persuasion in Multi-Agent Systems — How a single adversarial agent reduces accuracy by 10–40%
- Error Cascading in Multi-Agent Systems — How initial mistakes propagate through downstream agents
- Context Rot in Large Language Models — Accuracy degradation at 100K tokens
- VeriMAP: Verification-Aware Multi-Agent Planning — Framework reducing cascading failures through subtask-level verification
- Heterogeneous Agent Teams — 91% accuracy with heterogeneous teams vs 82% for homogeneous
- Agent Observability: Distributed Tracing — Spanning model APIs, databases, tools, and sub-agents
- Gradient Institute: Multi-Agent Risk Analysis — Bias amplification, conformity bias, and governance
Protocols and Standards
- Model Context Protocol (MCP) — JSON-RPC 2.0 standard for agent-to-tool interaction
- Agent2Agent Protocol (A2A) — Peer-to-peer agent communication with Agent Cards
- A2A and MCP Comparison — Complementary roles in agent communication
- Natural Language Interaction Protocol (NLIP) — Ecma International protocol for language-centric agent communication
- Foundation for Intelligent Physical Agents (FIPA) — IEEE-accepted standards from the 1990s-2000s
Frameworks and Implementation
- Best Multi-Agent Frameworks 2026 — Overview of LangGraph, OpenAI Agents SDK, CrewAI, and AutoGen
- Multi-Agent Orchestration Best Practices — Structured outputs and schema validation
- Vellum: Multi-Agent Context Engineering — Context scoping strategies for sub-agents
- Code Broker: Hierarchical Two-Layer Architecture — 93.9% HumanEval, 83.1% MBPP on code generation
Benchmarking and Evaluation
- Single-Agent vs Multi-Agent on Multi-Hop Reasoning — Stanford study showing single-agent competitiveness under equal token budgets
- MultiAgentBench: Evaluating Collaboration and Competition — First comprehensive benchmark for LLM-based multi-agent systems
- Emergent Coordination in MARL — Phase transitions and Instability Ridge in agent learning dynamics
- When Multi-Agent Benchmarking is Misleading — Most published work evaluated on non-agentic static benchmarks
Commercial and Organizational Applications
- Anthropic Project Deal: Agent Commerce Pilot — 186 transactions across 500+ items in one week
- A2A at 150+ Production Deployments — Adoption by Microsoft, AWS, Salesforce, SAP, ServiceNow
- Conway's Law and Multi-Agent Design — How organizational structure constrains system architecture
- Enterprise AI Adoption and Scaling Challenges — 72% adoption rate but <10% production success
- Specialized Agents Outperform Generalists — 20% higher accuracy and 40% faster results on domain-specific tasks
- Multi-Agent Framework Usage Growth — 327% increase in Databricks usage (June–October 2025)
Consensus and Fault Tolerance
- Practical Byzantine Fault Tolerance (PBFT) — Guarantees safety with at most f faulty agents out of 3f+1
- Raft Consensus Algorithm — Crash-fault tolerance; does not handle Byzantine faults