FEB 21, 2026 • LEARNING • 8 MIN READ

How We Built Our First Multi-Agent System (and What Went Wrong)

An honest account of the failure modes, debugging nightmares, and hard-won patterns from building our first production multi-agent system for pharmaceutical quality investigation.

By Wenable Labs

Multi-agent systems sound elegant in architecture diagrams. Multiple specialized agents, each handling a narrow domain, coordinated by a supervisor that synthesizes their outputs into a coherent result. The diagrams are clean. The arrows are tidy. The separation of concerns is textbook.

In practice, building our first multi-agent system was humbling. We shipped it, and it works well in production today. But the path from “this should work” to “this actually works” was considerably longer than we projected. The system did not fail in the ways we anticipated. It failed in ways that only emerge when multiple autonomous components interact under real-world conditions. This is an honest account of what went wrong and what we learned.

The Initial Excitement

The appeal of multi-agent architecture is intuitive: decompose a complex problem into specialized sub-problems, assign each to a purpose-built agent, and let a supervisor coordinate the whole. The theoretical benefits modularity, specialization, parallel execution are well documented in the literature.

Our pharmaceutical quality investigation system was the first candidate. The use case demanded analysis of manufacturing deviations across six dimensions, following the established 6M framework: Man, Machine, Material, Method, Measurement, and Mother Nature (Environment). Each dimension requires distinct domain expertise. A human factors analysis draws on different knowledge and reasoning patterns than a materials science investigation or an equipment calibration review.

Six agents, each an expert in one dimension, coordinated by a supervisor that synthesizes their individual findings into a comprehensive investigation report. The architecture diagram looked clean. We estimated four weeks to production. It took closer to twelve.

The decomposition was sound. The problems were all in the interactions.

Failure Mode 1: Cascading Hallucinations

The first serious failure emerged during validation testing. One of the specialist agents the Material agent, analyzing raw material inputs for a batch deviation hallucinated a plausible-sounding root cause. It cited a supplier certificate number that did not exist and attributed a contamination pathway to a material lot that had already passed quality control. The claim was specific, detailed, and entirely fabricated.

That would have been manageable in isolation. A single agent hallucination is a known risk with established mitigations. The compounding problem was what happened next. The other agents received the Material agent’s output as contextual input for their own analyses. The Method agent incorporated the fabricated contamination pathway into its process flow analysis. The Measurement agent referenced the nonexistent certificate number when evaluating analytical results. The supervisor agent, receiving five downstream analyses that all corroborated the Material agent’s fiction, synthesized a confident and thoroughly wrong investigation report.

The failure was cascading hallucination: one agent’s fabrication became five agents’ shared assumption.

The fix required structural changes. Every agent output now includes a confidence score and a citation chain that traces each claim back to specific source documents. The supervisor validates sources before incorporating any claim into the synthesis. If an agent cannot cite specific evidence from the investigation record for a factual claim, that claim is flagged as unsupported and excluded from the final report. We also introduced cross-validation checks where agents with overlapping domain knowledge independently verify each other’s critical findings.

Failure Mode 2: The Debugging Nightmare

When the final investigation report contained an error and in the early weeks, it frequently did we could not determine which agent introduced it. A wrong conclusion in the supervisor’s synthesis might have originated in any of the six specialist agents, or in the supervisor’s reasoning over their combined outputs, or in the interaction between two agents’ partial conclusions.

Our initial logging was insufficient. We logged inputs and outputs at the system boundary but treated the inter-agent communication as an internal detail. This was a fundamental mistake. Debugging a multi-agent system without visibility into every agent interaction is like debugging a distributed microservices architecture without distributed tracing. The analogy is not casual it is exact. A multi-agent system is a distributed system, and it demands the same observability infrastructure.

The fix was comprehensive. Every agent action now receives a trace ID that propagates across all downstream calls. We implemented structured logging with full input-output pairs at each agent step, including the prompts sent to language models and the raw completions received. Most critically, we built a replay capability that allows us to re-run any single agent in isolation using the exact inputs it received during a production run. When a report contains an error, we can now trace it to the specific agent, the specific input that triggered it, and the specific reasoning step where the logic went wrong.

This observability infrastructure took three weeks to build. We should have built it before we built the second agent.

Failure Mode 3: Cost Explosion

Six specialist agents, each making multiple LLM calls with full context windows, each retrieving documents from the knowledge base independently. Our initial cost per investigation was roughly 10x what we had projected. A single pharmaceutical deviation investigation was consuming token volumes that made the unit economics untenable.

The root cause was straightforward: we had optimized for capability without considering cost architecture. Every agent loaded the full investigation record into its context window. Every agent made independent RAG calls to retrieve relevant SOPs, batch records, and historical deviation data. The same documents were being retrieved and processed six times.

The fix involved three changes. First, model routing: routine analytical steps run on fine-tuned smaller language models, while complex reasoning tasks that require nuanced judgment route to larger models. Not every agent step demands the same model capability. Second, aggressive context pruning: each agent receives only the subset of the investigation record relevant to its dimension, not the full file. Third, shared knowledge retrieval: a single RAG pipeline retrieves all relevant documents once, and the results are distributed to the specialist agents based on relevance scoring. One retrieval call serves all six agents instead of six independent calls duplicating work.

These changes reduced cost per investigation by approximately 8x, bringing unit economics within the original projections.

Failure Mode 4: Coordination Overhead

The supervisor agent was supposed to be a lightweight coordinator. In practice, it became the bottleneck. It spent more tokens managing agent interactions formatting requests, parsing responses, resolving ambiguities, handling edge cases in agent outputs than the specialist agents spent on actual investigation work. The coordination layer consumed more compute than the analytical layer.

Agent handoffs were a particular source of friction. When the supervisor passed context from one agent’s output to another agent’s input, information was lost or distorted. The agents did not share a common schema for their outputs, so the supervisor was constantly translating between formats. Parallel execution introduced additional problems: two agents occasionally attempted to access the same data source simultaneously, producing race conditions that led to incomplete or inconsistent retrievals.

The fix centered on clearly defined agent contracts. Each specialist agent now has a typed input schema and a typed output schema that specifies exactly what it receives and what it produces. The supervisor no longer translates between formats it validates that each agent’s output conforms to its contract and routes structured data directly. We moved to asynchronous execution with explicit merge points, replacing ad hoc parallelism with a defined execution graph. A shared state store manages concurrent access to data sources, preventing the race conditions that had produced inconsistent results.

The Patterns That Survived

After twelve weeks of iteration, the system that reached production retained the multi-agent architecture but looked substantially different from the initial design. The patterns that survived the gauntlet of production failures are the ones we now apply to every multi-agent system we build:

Specialist agents with narrow, well-defined domains. Broad agents that try to cover multiple domains produce mediocre results across all of them. Narrow agents with clear boundaries produce reliable results within them.
A supervisor that validates before it synthesizes. The supervisor is not a passive aggregator. It actively verifies citations, checks for contradictions between agent outputs, and flags unsupported claims.
Distributed tracing across all agent interactions. Every agent action is traceable from the final output back to the originating input. No exceptions.
Model routing to control costs. Match model capability to task complexity. Not every step requires the largest available model.
Agent contracts with typed inputs and outputs. Informal, unstructured communication between agents is a source of bugs. Typed contracts eliminate an entire category of failure.
Human-in-the-loop at the final synthesis step. For high-stakes domains like pharmaceutical quality, the system produces a draft investigation report. A human quality specialist reviews, adjusts, and approves before the report is finalized.

Start With One Agent

Multi-agent systems are powerful, but they are not simple. The gap between an architecture diagram and a working production system is filled with failure modes that only appear when multiple autonomous components interact at scale. Cascading hallucinations, opaque debugging, cost explosions, and coordination overhead are not theoretical risks they are the default experience for teams building their first multi-agent system.

Our advice, based on building through these failures: start with a single agent that does one thing well. Add a second agent only when the problem complexity genuinely demands it. Invest in observability infrastructure before adding more agents, not after. And budget twice the time you think you need, because the interactions between agents will surprise you in ways that the architecture diagram never predicted.