Most AI outages I get called into are not model problems. They are system problems wearing model symptoms. The app is slow, answers change between retries, costs spike on Tuesdays, and no one can say why a prompt that worked yesterday fails today. The model didn’t wake up cranky. The system was never designed like a system.
Where the cracks show up
- Intermittent correctness: same input, different answer, and no trace of what changed. Usually a hidden prompt template change, a different retrieval set, or a fallback model kicking in silently.
- Tail latency blowups: p50 looks fine until your p99 crosses your SLA because one tool call or embedding lookup stalls the entire chain.
- Cost outliers: 10 percent of calls account for 60 percent of spend due to token bloat, runaway retries, or hidden fan-out.
- “Works in staging, melts in prod”: retrieval indexes out of date, embeddings from a different model dimension, or a schema change that dropped 30 percent of your corpus.
- Silent regressions: a new prompt ships, the acceptance test still passes, but slightly more cases miss key facts. You do not notice for two weeks.
I’ve seen a fintech ship a chat assistant that passed a happy-path demo but cratered under live traffic. Root cause: exponential backoff combined with user retries created write amplification into the LLM gateway. The fix was two lines of circuit breaker logic and a cache. It took three days to find because there were no traces across steps.
Why this happens in real systems
- Treating the LLM like a function call instead of an unreliable, probabilistic service with external dependencies. If you assume determinism, you design zero guardrails.
- No explicit budgets. Latency, tokens, and cost are not treated as first-class constraints, so prompts and retrieval keep growing until something breaks.
- Prompt and schema coupling. A minor change to a tool or output schema silently invalidates prompt assumptions. Without versioning, good luck debugging.
- Retrieval treated as a black box. Ingestion, chunking, and reindexing are ops concerns until they suddenly become correctness concerns.
- Evaluation as a one-off event. Teams run an initial eval, declare victory, and never wire it into CI or production drift detection.
The common misunderstanding is thinking “model quality” is the major lever. In reality, most production wins come from mundane system design: caching, routing, timeouts, telemetry, and contracts.
Technical deep dive: design the system, not just the prompt
Here is a production-grade mental model I use. You can scale this from a single feature to a platform.
1) Request router and policy layer
- Responsibilities: rate limiting, tenant isolation, authn/authz, A/B flags, and model selection policy.
- Trade offs: simple static routing is easier to reason about but leaves money on the table. Dynamic routing by intent or difficulty reduces cost but adds complexity and needs good confidence thresholds.
- Failure modes: cascading retries across clients and server when 429s happen. Fix with idempotency keys, per-tenant concurrency limits, and circuit breakers with fast fail.
2) Prompt assembly with versioning
- Treat prompts as code. Store them with semantic versions, changelogs, and owners. Hash the assembled prompt into the request trace.
- Enforce budgets. Validate token counts before sending, trim gracefully, and fail closed if a hard cap is hit. If you do not cap, your worst case becomes your average case under load.
- Keep schema contracts explicit. If you expect JSON, use constrained decoding or JSON schema tools rather than regex band-aids.
3) Retrieval plane that you can trust
- Ingestion pipeline: structured, versioned, observable. Log document counts, chunk counts, and failures per source. If counts drop, alert a human.
- Embedding registry: record model, dimension, and hyperparameters. Mismatched dimensions or model swaps without reindexing are common silent killers.
- Hybrid search: combine BM25 and vector scores. Add a minimal recall floor so you do not return empty contexts without noticing.
- Health checks: queries with known answers. If top-k does not include the control doc, page someone.
4) Tooling orchestration
- Idempotent tools with request IDs. Non-idempotent effects plus LLM retries equal duplicated side effects and surprise bills.
- Timeouts and compensation. If a tool call exceeds its budget, short-circuit with a fallback response rather than blocking the whole chain.
- Version your function signatures. A schema tweak must not silently break the prompt that expects the old signature.
5) Output post-processing and safety
- Constrained outputs where feasible. For classification and extraction, use small models with strict formats. Save the big model for the fuzzy parts.
- Basic safety checks are just input validation for AI. Regex, schema validation, and whitelist filters catch a lot of nonsense before it leaks to users.
6) Telemetry and observability wired from day one
- Per-step traces with the same correlation ID: retrieval hits, prompt version, model name, tokens in and out, latency, retries, and cache events.
- Store a small sample of inputs, outputs, and retrieved passages for audit and eval. You will need them in postmortems.
- Dashboards you actually look at: p50/p95/p99, cost per request, cache hit rate, retrieval recall on canaries, and quality proxies.
7) Caching where it matters
- Three layers that pay off fast: retrieval result cache, prompt assembly cache, and output cache for deterministic tasks.
- Cache keys must include what affects the output: user locale, prompt version, policy flags. If you forget a dimension, you serve the wrong answer fast.
- Tie cache invalidation to content updates. When a document changes, broadcast a purge event for its derived keys. TTL alone is lazy and expensive.
8) Model strategy and routing
- Use small models for easy cases. Intent classification, simple transformations, and schema mapping are not 70B-token problems.
- Escalate by uncertainty, not by hope. If the small model’s confidence is low or guard checks fail, route to the bigger model.
- Shadow and canary by default. Ship to 5 percent of traffic with traces on. Diff responses against a control and watch the cost delta.
Failure modes to expect
- Retry storms: client retries plus server retries multiply cost and load. Fix with jittered backoff, concurrency caps, and end-to-end idempotency keys.
- Context explosion: a new product line doubles average retrieved tokens. Your monthly bill doubles overnight. Cap and summarize upstream.
- Quiet drift: new content types land in the index but your chunker was tuned for the old schema. Retrieval quality decays slowly and no one notices.
- Schema drift: a tool starts returning nulls in an optional field that your prompt assumed existed. Your evaluator never tested that path.
- Dead caches: a hot key grows until it no longer fits limits. Every request becomes a miss during peak hours.
Practical fixes you can ship next sprint
- Write a one-page “system contract” per AI feature. Include: target p95 latency, max tokens in and out, max cost per request, expected recall@k, and accepted failure policies.
- Build a golden set and wire it into CI. At least 200 representative queries with expected behaviors. Fail PRs that drop quality or blow past cost and latency budgets.
- Add a canonical trace schema. Every step logs: request_id, user_id hash, prompt_version, model_name, tokens_in, tokens_out, latency_ms, cache_hit, retrieval_doc_ids.
- Put a budget gate on prompt assembly. If tokens exceed N, trim with a clear policy or ask the router to escalate to a bigger context model. Do not hope it fits.
- Stabilize RAG: unit test your chunker, snapshot index stats on each build, and add a reindex job that runs deterministically with backfill metrics.
- Introduce a retrieval recall canary. 20 daily sampled queries with known answers. Alert if recall or MRR drops by a threshold.
- Install a small-model fast path. Use a 3 to 7B model or a specialized classifier for intent and easy answers. Track deflection rate and quality delta.
- Caching policy: normalize inputs, use content-addressable keys, and add a purge hook tied to data updates. Measure cost saved per cache.
- Operation hygiene: timeouts on every call, circuit breakers on model gateways, and per-tenant concurrency caps. Turn off implicit automatic retries in SDKs you do not control.
Business impact that leaders should care about
Cost: assume 100k requests per day. Average 1.5k input tokens and 500 output tokens. If your provider charges roughly $X per 1k input tokens and $Y per 1k output tokens, your daily cost is about 100k * (1.5 * X + 0.5 * Y). If your average context doubles because retrieval is sloppy, your cost doubles. A 25 percent cache hit rate on stable prompts usually pays for itself in week one. Retry storms can add 30 to 70 percent to spend without improving outcomes.
Latency: users feel p95, not p50. One slow tool or a retrieval miss that triggers escalation to a larger model will drag your tail. Streaming partial tokens helps perceived latency but does not fix time-to-first-token if upstream is slow. Put timeouts before the model call, not after.
Scale risk: throughput scales with concurrency, but your real bottleneck is often the vector store or a shared rate limit at the gateway. I have seen teams add more workers while a single index partition was saturated at 70 percent CPU. You need partition-aware routing and backpressure sooner than you think.
Quality: most teams overpay for bigger models when the fix is better retrieval and guardrails. A rigorous RAG pipeline plus a modest model often beats a huge model with naive context stuffing. The curve is not linear; past a point you buy very little quality for a lot more cost and latency.
Vendor risk: if your stack assumes a single provider’s JSON format or tool capability, you cannot fail over when quotas hit. Abstract the call, record provider metadata, and keep a minimal compatible schema so you can switch.
Key takeaways
- Design the request flow like a distributed system, not a fancy autocomplete.
- Put explicit budgets on tokens, latency, and cost. Enforce them in code.
- Version prompts, tools, and retrieval configs. Hash them into traces.
- Build a real retrieval pipeline with health checks and canaries.
- Cache where it counts and invalidate intentionally, not with wishful TTLs.
- Route by difficulty and confidence. Use small models for the easy 60 percent.
- Wire evals and drift checks into CI and production dashboards.
- Add timeouts, circuit breakers, and idempotency across the chain.
If you want help
If this sounds familiar, you are not alone. I help teams put system design around LLM features, get costs predictable, and make behavior stable under load. If you are seeing tail latency spikes, mystery bills, or quality drift, this is exactly the kind of thing I fix when systems start breaking at scale.

