The recurring smell
Most AI outages I get called into are not model problems. They are architecture problems disguised as model issues. Latency spikes, random failures, wrong answers, costs drifting 3x in a quarter. Everyone blames the LLM. Nine times out of ten, it is control planes with no guardrails, retrieval that drifts over time, or retry storms you created yourself.
I have seen teams spend weeks swapping models while a single missing backpressure valve kept their production p95 at 15 seconds.
Where this shows up and why
- RAG systems that worked in a demo, then collapse when the index grows 10x
- Agents that loop tools and eat your monthly budget in a weekend
- Multi-tenant apps where one noisy customer takes everyone down
- Model upgrades that silently degrade quality because eval is weak or missing
Why it happens:
- PoC code becomes production by inertia
- LLM treated as a pure function, but the workload is stateful and spiky
- Over-trusting a vector database to solve retrieval without governance
- No observability that ties cost, latency, and content back to a specific query and version
What most teams misunderstand:
- More context is not better. It is slower, costlier, and often reduces accuracy
- Fine-tuning is not a silver bullet. Data quality and orchestration usually matter more
- “We log prompts” is not observability. You need structured traces with lineage and versioning
Deep dive into the mistakes
1) Stateless thinking in a stateful workflow
Symptoms: long chats with ballooning context, tool-call loops, inconsistent answers between turns.
Why: requests are stateless, but the task is not. Without session policy, token budgets, and memory discipline, you create unbounded context.
Failure mode: p95 latency and cost creep up over time. Accuracy drops as the prompt becomes noise.
2) Retrieval by vibes
Symptoms: top k=5 stuffed into the prompt, irrelevant chunks, month-old indexes, and silent schema drift.
Why: teams skip document normalization, chunk strategy, and metadata governance. They never version embeddings or indexes.
Failure mode: evaluation looks fine on day 1, degradation begins on day 30, supports tickets pile up by day 90.
3) No backpressure or circuit breakers
Symptoms: provider 429, cascading retries, timeouts that fan out to every dependency.
Why: direct synchronous calls to LLMs with optimistic concurrency. No queues, no rate limiters, no budget caps.
Failure mode: a single spike or a minor provider incident takes down your entire path.
4) Over-abstracted orchestration
Symptoms: a pretty graph of nodes that hides real costs and token paths. Hard to debug, impossible to tune.
Why: library-first design. Business logic gets buried under a flow framework with magic retries and hidden prompts.
Failure mode: fragmented prompts, duplicate tool calls, and ghost retries you cannot turn off.
5) Caching the wrong thing
Symptoms: 2 percent cache hit rate and negligible savings.
Why: caching full completions with variable headers or timestamps. No normalization, no tiered caches, no pre-LLM caching.
Failure mode: you pay for the same retrieval, rerank, and system prompt again and again.
6) Premature fine-tuning
Symptoms: expensive training cycles to fix what is really prompt discipline or retrieval quality.
Why: fine-tuning feels like control. The issue was data freshness, schema, or guardrails.
Failure mode: model lock-in, infra sprawl, and no measurable uplift.
7) Weak multi-tenant isolation
Symptoms: one large customer burns through RPS and tokens, everyone else slows down.
Why: global limits only. No per-tenant concurrency, no per-tenant cost caps, no shard-aware caches.
Failure mode: noisy neighbor plus confusing bills.
8) Thin evaluation and no change management
Symptoms: model upgrade ships. Support volume jumps. Nobody knows why.
Why: narrow golden sets, no hallucination checks, no adversarial prompts, no regression gates, no shadowing.
Failure mode: quality whiplash on every dependency change across model, embeddings, or index.
9) Missing observability where it matters
Symptoms: logs everywhere, insight nowhere.
Why: unstructured logs, no request IDs, no token or cost annotation, no linkage from final answer to retrieved docs and versions.
Failure mode: you cannot reproduce or defend a single output.
Practical fixes that hold up
Put the control plane first
- Per-tenant quotas: RPS, concurrent calls, and token budgets
- Cost guards: max tokens per request, per flow, and per session
- Circuit breakers per provider and per model family, with clear fallback chains
- Backpressure: queue spikes, not people. Use priority queues for human-in-the-loop paths
Treat context as a scarce resource
- Hard token budgets per step. Force summarization or truncation by role: system > instructions > user > retrieved
- Dedup retrieval by document ID and section. Do not paste the same paragraph twice
- Rerank before generate. Rerankers are cheap. Generations are not
Make retrieval boring and reliable
- Normalize and chunk by semantic units, not fixed windows only
- Attach tight metadata: source, section, version, timestamp, permissions
- Version embeddings and indexes. Store embedding model name, dim, and creation time
- Refresh policies: incremental builds daily, full rebuilds weekly or on schema change
Cache where it matters
- Tier 1: pre-LLM cache for deterministic steps, like retrieval results and rerank outputs keyed by normalized query and index version
- Tier 2: prompt template plus normalized variables for short deterministic generations
- Normalize inputs: lowercase, strip timestamps, sort keys
- Track hit rate and dollar savings per cache layer
Observability that survives audits
- Structured tracing with a single correlation ID per request and per session
- Log token counts, latency, model name, prompt hash, and exact retrieved document IDs with versions
- Capture final answer plus intermediate tool outputs and decisions
- Redact PII at the edge. Encrypt traces at rest. Keep a retention policy
Evaluation with teeth
- Golden sets by scenario, not just random samples. Include long input, adversarial input, known tricky entities
- Mix judges: rule checks, LLM judges, and task-specific metrics
- Shadow deploy and canary any change to model, embedding, index, or prompt
- Require non-regression thresholds on quality and latency to ship
Keep orchestration legible
- Prefer explicit code over magic DAGs for core logic
- One prompt per responsibility. Small, named, versioned
- Tool calls are idempotent with timeouts and retries that you control
Multi-tenant isolation as a feature
- Per-tenant concurrency and token rate limits
- Separate queues and caches for large tenants
- Cost attribution at the request level and monthly caps with alerts
Business impact in real numbers
- Latency: cutting unbounded context and adding rerankers typically drops p95 by 30 to 60 percent
- Cost: pre-LLM caching plus token budgets often reduces spend 25 to 50 percent. In one case, cache and rerank lowered generation spend from 42k to 19k per month
- Reliability: circuit breakers and backpressure take 429 error rates from bursts of 15 percent to under 1 percent during provider incidents
- Quality: retrieval governance and real eval stop the slow drift that drives support cost. Expect 20 to 40 percent reduction in escalations for RAG-heavy products
What to remember
- Put guardrails around tokens, not just requests
- Retrieval is a data system. Version it like one
- Backpressure and circuit breakers are mandatory if you call external models
- Cache earlier in the flow. Normalize everything
- Eval is a release gate, not a dashboard after the fact
- Observability needs lineage: model, prompt, retrieval, and costs tied to the same trace
If this sounds familiar
If you see rising p95, climbing token bills, or quality that drifts month to month, it is probably not a model problem. It is architecture. This is exactly the kind of work I help teams untangle when systems start breaking at scale. Happy to take a look at your traces, budgets, and retrieval pipeline and point out where the leaks are.

