The hidden bottlenecks in multi-agent AI systems
Everyone loves the demo where a planner agent hands work to a researcher, who hands work to a critic, who hands work to a summarizer, and it all looks intelligent. Then you ship it to real traffic and your p95 jumps to 40 seconds, tool costs double, and you are paged at midnight because agents decided to negotiate with each other forever. I have seen more multi-agent systems killed by orchestration tax than by model quality. The model is not your only bottleneck. Often it is not even the main one.Where these problems show up
- Research assistants that browse, extract, and write
- Code generation flows with planner, implementer, tester, fixer
- Customer support triage with multiple specialist agents
- Sales intel or market mapping with parallel web tooling
Why it happens and what most teams miss
- Every agent hop adds hidden overhead: tokenization, JSON validation, schema coercion, context packing, and network.
- Tools dominate latency and cost, not the LLM. Web search, scraping, database calls, and serverless cold starts are the killers.
- Long-tail behavior. Your mean looks fine but p95 explodes due to retries, backoffs, or a single slow tool that blocks the rest.
- Context bloat. Each hop re-sends a giant transcript. You pay multiple times for the same megabytes of tokens.
- Unstable loops. Two or three agents can ping-pong with confident nonsense unless you force convergence.
- Orchestrator hot spots. Python event loops, GIL contention, and overzealous tracing can be more expensive than inference.
Technical deep dive
Architecture-level reality, not the diagram on the slide
A typical multi-agent stack:- Entry API or queue
- Orchestrator or graph runtime (LangGraph, AutoGen, CrewAI, custom)
- Memory and state store (Postgres, Redis, KV)
- Vector DB for retrieval
- Tooling layer: HTTP clients, browser automation, SQL, internal APIs, serverless functions
- LLM providers and small local models
- Tracing and metrics
1) Orchestration tax per hop
- JSON schema validation and Pydantic style coercion add 5 to 50 ms each step.
- Tokenization and context assembly can be 50 to 250 ms on non-trivial prompts, especially with Python tokenizers.
- Serialization, logging, and span export add 10 to 100 ms depending on your observability setup.
- If your agent chain has 6 hops, a conservative 150 ms per hop is already 900 ms of overhead without a single model token generated.
2) Tool latency dwarfs LLM latency
- Web search APIs: 300 to 800 ms each, often with rate limits
- Scraping via Playwright: 1.5 to 6 seconds with cold starts
- Internal microservices: 30 to 200 ms, but fanout across 5 services stacks to seconds
- Vector DB: 20 to 200 ms per query, worse under high QPS or if you do cross-namespace fanout
3) Rate limits and retry storms
- Vendor 429 leads to synchronized exponential backoff across agents.
- Per-agent limiters cause head-of-line blocking even when global capacity exists.
- Missing idempotency keys make retried tool calls do duplicate work.
4) Context bloat and transcript reuse
- Many frameworks resend the full conversation to each agent. That is linear in hops and quadratic in cost once you add critics and reviewers.
- I have seen 30 to 50 percent of tokens spent on re-sending irrelevant history.
5) Loop instability
- Supervisor patterns that rely on open-ended critique can oscillate.
- Without a typed state and termination conditions, agents disagree and keep talking.
- A tiny percent of tasks loop forever, but they torch your p99 and your budget.
6) Infrastructure friction
- Serverless tools cold start often. 300 ms for Python, 1 to 2 s for browsers.
- HTTP connection churn from naive clients. No keep-alive, no HTTP 2 multiplexing.
- Vector DB partitioning that scatters reads across many shards for small queries.
- Local model serving with KV cache thrash when multiple agents interleave.
7) Observability overhead
- Per-token spans or verbose trace exports over the wire kill throughput.
- Oversampled tracing adds 5 to 10 percent latency on hot paths. It looks small until you have 10 hops.
Failure modes I keep seeing
- Deadlocks when a tool depends on a state write that never commits due to a cancelled span.
- Message explosion via fanout to N agents where most do duplicate work.
- State drift: different agents read stale copies of working memory, overwrite each other, and argue.
- Subtle content security failures when scraping tools follow redirects to login walls and stall.
Practical solutions that actually work
These are the patterns I recommend when teams ask me to make their multi-agent system not slow and not expensive.1) Replace chatty agents with a typed state machine
- Model your workflow as a graph with explicit nodes and transitions, not a chat between personas.
- One executor LLM node with tool use is often enough. Add small local routers or critics sparingly.
- Use structured state, not transcripts. Pass only what the next node needs. Keep the rest in a store.
- Hard-cap hops and set termination conditions. If you cannot write a clear end state, your design is not ready.
2) Collapse hops and parallelize tools
- Merge planning and execution when the plan is short. Spend tokens on better tools, not more agents.
- Fire independent tool calls in parallel. Speculate cheap calls ahead of time when likely needed.
- Use early exit. If top 3 docs are enough, stop retrieval. Do not let a critic force more hits by default.
3) Control context explicitly
- Maintain a working memory object with typed fields. Do not re-send history.
- Delta prompts: send only the relevant slice per node.
- Trim thought tokens. Cap reasoning depth and use short chain-of-thought strategies. Hidden prompts are still tokens and time.
4) Rate limiting and backpressure at the right place
- Global token-bucket across all agents for a task, not per-agent. That prevents a reviewer from starving the executor.
- Idempotency keys for tool calls. Cache by content hash for deterministic work like retrieval and scraping.
- Circuit breakers around slow tools. Fallback to cached or approximate results.
5) Fix the infrastructure tax
- Use HTTP keep-alive and HTTP 2 multiplexing. Reuse clients. Warm serverless functions with min instances if you rely on them.
- Reduce Python overhead. Avoid heavy validation in hot loops. Use dataclasses or pydantic compiled modes. Consider a small Go or Rust sidecar for tokenization and context assembly if that is hot.
- Batch vector DB queries and filter first using metadata. Keep an in-memory working set store for the current task.
- If serving local models, pin concurrency to avoid KV cache churn. Group turns for the same session.
6) Observability that pays for itself
- Per-step traces with these fields: step_type, tokens_in, tokens_out, tool_calls, tool_latency_sum, retry_count, loop_depth, context_bytes.
- Sample aggressively at the span level, not the token level. Keep 100 percent sampling for errors and long tails.
- Correlate user task id through all agents and tool calls. If you cannot reconstruct a single task path, you cannot fix p95.
7) Hard budgets and kill switches
- Time budget per task and per node. Cancel downstream work when the budget is gone.
- Token budget per task. If exceeded, degrade gracefully: fewer docs, smaller models, cached results.
- Loop guardrails: max transitions, max critique rounds, forced finalize path.
8) Test for the long tail, not the happy path
- Soak tests with 1 to 2 percent injected 429s and 100 to 500 ms random jitter on tools.
- Monte Carlo plans. If a plan needs more than 8 steps in 20 percent of runs, simplify the graph.
- Measure tasks per dollar, not just accuracy.
Rough math to sanity-check a design
Suppose you have 6 hops. Each hop has:- 200 ms orchestration tax
- 1.2 s model latency
- 400 ms of tool time for half the hops on average
- Overhead: 0.6 s
- Model: 3.6 s
- Tools: same work, but more parallel and cached, ~0.6 to 0.8 s
- Total p50 ~ 4.8 to 5.0 s. That is the difference between usable and abandoned.
Business impact you can take to a roadmap meeting
- Cost: Multi-agent transcripts inflate tokens. I often see 1.5 to 2.3x token spend per task vs a single-agent skill graph. At scale, that is real money.
- Latency: Every additional hop increases abandonment. If you are customer facing, a 6 s to 12 s swing kills conversion and NPS.
- Reliability: Loop escapes and retry storms lead to midnight incidents. On-call cost is not just money. It is team morale and velocity.
- Scaling risk: Provider rate limits hit earlier because you split one task into many concurrent calls. You also fan out to more external services, widening the blast radius.
- Complexity tax: More agents means more prompts to maintain and more interactions to test. Your iteration speed drops.
Key takeaways
- Most multi-agent slowdowns are self-inflicted by orchestration, not the LLM.
- Replace chatty personas with a typed state machine and strict termination.
- Optimize tool usage first. Parallelize, cache, and circuit break.
- Control context. Stop re-sending transcripts. Keep structured state.
- Apply global budgets and backpressure. Kill work you cannot afford.
- Trace per step with real metrics. Fix p95, not just p50.
How Do the Bottlenecks in Multi-Agent AI Systems Affect the Cost of Running LLM Apps on AWS?
Bottlenecks in multi-agent AI systems can significantly impact operational efficiency, leading to increased costs when running Large Language Model (LLM) applications on cloud platforms like AWS. These hurdles can cause delays and resource underutilization, pushing expenses higher. For more insights, Read: Your cost to run LLM on AWS.

