All writing

The Orchestrator Trap: Why Your Multi-Agent System Keeps Falling Apart at the Seams

Most teams I talk to have the same problem. They built a beautiful demo — four agents, clean handoffs, a central orchestrator routing tasks like clockwork. Then they hit production and the whole thing starts fraying. Agents stall waiting for context they were never given. The orchestrator retries a failed subtask five times and burns a week's token budget in an afternoon. A downstream agent confidently acts on a hallucinated output from upstream and nobody catches it until a customer does.

The agents aren't the problem. The orchestration layer is.

What Orchestration Actually Means in a Multi-Agent System

Orchestration isn't just "send task to agent, collect result." In a real system, it covers:

  • Decomposition — breaking a goal into subtasks that agents can actually execute
  • Dependency resolution — knowing which subtasks must complete before others start
  • State propagation — making sure each agent gets exactly the context it needs, no more
  • Failure handling — deciding what to retry, what to escalate, what to abandon
  • Output validation — verifying that agent responses are usable before passing them forward

Most frameworks give you primitives for the first one and leave the rest as an exercise. That's where teams end up writing orchestration logic in ad-hoc glue code scattered across five files, and then wondering why the system behaves unpredictably.

The State Propagation Problem Is Worse Than It Looks

Here's the failure mode I see constantly. A planning agent produces a structured JSON plan. The orchestrator passes the whole thing to an execution agent. The execution agent reads only the fields it cares about, ignores the rest, and produces output that's technically correct for its subtask but misaligned with the broader plan — because some of that "ignored" context was load-bearing.

The fix isn't to dump more context into every agent call. That degrades badly: latency climbs, costs climb, and models start losing coherence over very long contexts even with large windows. The fix is explicit context contracts — each agent gets a defined input schema, produces a defined output schema, and the orchestrator is responsible for transforming between them.

typescript
// Explicit contract between orchestrator and execution agent
interface PricingAgentInput {
  destination: string;
  travelDates: DateRange;
  passengerCount: number;
  constraints: PricingConstraints; // only what this agent needs
}

interface PricingAgentOutput {
  priceBreakdown: LineItem[];
  confidence: number; // 0–1
  warnings: string[];
}

This isn't novel software engineering — it's the interface discipline we've always applied to microservices, just applied to agent boundaries. The problem is that LLM-native frameworks often encourage you to skip it because "the model will figure it out." It won't. Not reliably.

Retry Logic Will Destroy You If You Get It Wrong

Retry-on-failure is table stakes. But naive retries in an agentic system are dangerous in ways that don't exist in a normal API context.

If an agent fails partway through a multi-step task, retrying from the top can duplicate side effects — a booking gets created twice, an email goes out twice, an API call mutates state you thought was safe. You need idempotency at the action level, not just the network level.

The other failure mode is cascading retries. Orchestrator retries agent A. Agent A calls agent B. Agent B times out and retries internally. Now you have exponential fan-out, and your system is hammering an external API with requests while the user stares at a spinner.

The principle I apply: retry loops must be bounded and they must not cross side-effect boundaries without explicit idempotency guards. If an agent performs a write operation, the retry logic needs to know that before it retries.

Validation Between Agents Is Non-Negotiable

I've written before about building for stability before moving fast — the same thinking applies here at the agent boundary level. Passing an unvalidated LLM output from one agent to the next is the agentic equivalent of SQL injection. You're trusting that the upstream agent produced something structurally sound, and that trust will be violated at the worst possible moment.

Validation doesn't have to be expensive. A Zod or Pydantic parse at each handoff point catches malformed output before it propagates. A lightweight confidence check — asking the agent to score its own output on a 0–1 scale and flagging anything below threshold — adds a cheap signal you can act on.

python
from pydantic import BaseModel, ValidationError

class FlightSearchResult(BaseModel):
    flights: list[FlightOption]
    search_confidence: float  # agent self-reported
    fallback_triggered: bool

try:
    result = FlightSearchResult.model_validate(raw_agent_output)
except ValidationError as e:
    orchestrator.escalate(task_id, reason=str(e))

This is boring infrastructure work. It also keeps your system from silently doing the wrong thing at scale.

The Orchestrator Should Be Thin, Not Smart

The temptation is to make the orchestrator itself an LLM — have it reason about task dependencies, handle failures dynamically, adapt the plan on the fly. Sometimes that's the right call. More often it's a source of nondeterminism in the one place where you need determinism most.

I prefer a tiered model: the orchestrator is deterministic code that follows explicit rules. LLM reasoning happens inside agents, at the leaf nodes. If the orchestrator needs to make a judgment call — say, an ambiguous user goal needs clarification before decomposition — that's a defined escalation path, not a free-form reasoning loop.

This is also the architecture that survives debugging. When something breaks, you want to be able to read a log and know exactly what the orchestrator decided and why. "The LLM decided" is not a satisfying answer at 2am.

The Deployment Reality

Building a well-orchestrated multi-agent system isn't slow work — the engineering itself ships in days to weeks with modern tooling. What slows teams down is the absence of clear contracts between agents from the start, which means every production incident becomes a reverse-engineering exercise. The data schemas, the failure policies, the validation rules: those decisions need to happen before you wire up your first agent-to-agent call, not after your first production outage.

If you're choosing which framework to build on, the lock-in question matters too — I'd think carefully about platform dependencies before committing your orchestration logic to any single vendor's abstraction.

The agents are the exciting part. The orchestration is the job.

Working on something like this? I take on a few fractional-CTO and AI engagements at a time.

The AI CTO playbook

Get my AI playbooks — straight to your inbox

Practical notes on shipping production AI, scaling teams, and the calls a CTO actually has to make. A few times a month. No spam, no fluff.