The recurring smell
Most AI outages I get called into are not model problems. They are architecture problems disguised as model issues. Latency spikes, random failures, wrong answers, costs drifting 3x in a quarter. Everyone blames the LLM. Nine times out of ten, it is control planes with no guardrails, retrieval that drifts over time, or retry storms you created yourself.
I have seen teams spend weeks swapping models while a single missing backpressure valve kept their production p95 at 15 seconds.
Where this shows up and why
- RAG systems that worked in a demo, then collapse when the index grows 10x
- Agents that loop tools and eat your monthly budget in a weekend
- Multi-tenant apps where one noisy customer takes everyone down
- Model upgrades that silently degrade quality because eval is weak or missing
Why it happens:
- PoC code becomes production by inertia
- LLM treated as a pure function, but the workload is stateful and spiky
- Over-trusting a vector database to solve retrieval without governance
- No observability that ties cost, latency, and content back to a specific query and version
What most teams misunderstand:
- More context is not better. It is slower, costlier, and often reduces accuracy
- Fine-tuning is not a silver bullet. Data quality and orchestration usually matter more
- “We log prompts” is not observability. You need structured traces with lineage and versioning
Deep dive into the mistakes
1) Stateless thinking in a stateful workflow
Symptoms: long chats with ballooning context, tool-call loops, inconsistent answers between turns.
Why: requests are stateless, but the task is not. Without session policy, token budgets, and memory discipline, you create unbounded context.
Failure mode: p95 latency and cost creep up over time. Accuracy drops as the prompt becomes noise.
2) Retrieval by vibes
Symptoms: top k=5 stuffed into the prompt, irrelevant chunks, month-old indexes, and silent schema drift.
Why: teams skip document normalization, chunk strategy, and metadata governance. They never version embeddings or indexes.
Failure mode: evaluation looks fine on day 1, degradation begins on day 30, supports tickets pile up by day 90.
3) No backpressure or circuit breakers
Symptoms: provider 429, cascading retries, timeouts that fan out to every dependency.
Why: direct synchronous calls to LLMs with optimistic concurrency. No queues, no rate limiters, no budget caps.
Failure mode: a single spike or a minor provider incident takes down your entire path.
4) Over-abstracted orchestration
Symptoms: a pretty graph of nodes that hides real costs and token paths. Hard to debug, impossible to tune.
Why: library-first design. Business logic gets buried under a flow framework with magic retries and hidden prompts.
Failure mode: fragmented prompts, duplicate tool calls, and ghost retries you cannot turn off.
5) Caching the wrong thing
Symptoms: 2 percent cache hit rate and negligible savings.
Why: caching full completions with variable headers or timestamps. No normalization, no tiered caches, no pre-LLM caching.
Failure mode: you pay for the same retrieval, rerank, and system prompt again and again.
6) Premature fine-tuning
Symptoms: expensive training cycles to fix what is really prompt discipline or retrieval quality.
Why: fine-tuning feels like control. The issue was data freshness, schema, or guardrails.
Failure mode: model lock-in, infra sprawl, and no measurable uplift.
7) Weak multi-tenant isolation
Symptoms: one large customer burns through RPS and tokens, everyone else slows down.
Why: global limits only. No per-tenant concurrency, no per-tenant cost caps, no shard-aware caches.
Failure mode: noisy neighbor plus confusing bills.
8) Thin evaluation and no change management
Symptoms: model upgrade ships. Support volume jumps. Nobody knows why.
Why: narrow golden sets, no hallucination checks, no adversarial prompts, no regression gates, no shadowing.
Failure mode: quality whiplash on every dependency change across model, embeddings, or index.
9) Missing observability where it matters
Symptoms: logs everywhere, insight nowhere.
Why: unstructured logs, no request IDs, no token or cost annotation, no linkage from final answer to retrieved docs and versions.
Failure mode: you cannot reproduce or defend a single output.
Practical fixes that hold up
Put the control plane first
- Per-tenant quotas: RPS, concurrent calls, and token budgets
- Cost guards: max tokens per request, per flow, and per session
- Circuit breakers per provider and per model family, with clear fallback chains
- Backpressure: queue spikes, not people. Use priority queues for human-in-the-loop paths
Treat context as a scarce resource
- Hard token budgets per step. Force summarization or truncation by role: system > instructions > user > retrieved
- Dedup retrieval by document ID and section. Do not paste the same paragraph twice
- Rerank before generate. Rerankers are cheap. Generations are not
Make retrieval boring and reliable
- Normalize and chunk by semantic units, not fixed windows only
- Attach tight metadata: source, section, version, timestamp, permissions
- Version embeddings and indexes. Store embedding model name, dim, and creation time
- Refresh policies: incremental builds daily, full rebuilds weekly or on schema change
Cache where it matters
- Tier 1: pre-LLM cache for deterministic steps, like retrieval results and rerank outputs keyed by normalized query and index version
- Tier 2: prompt template plus normalized variables for short deterministic generations
- Normalize inputs: lowercase, strip timestamps, sort keys
- Track hit rate and dollar savings per cache layer
Observability that survives audits
- Structured tracing with a single correlation ID per request and per session
- Log token counts, latency, model name, prompt hash, and exact retrieved document IDs with versions
- Capture final answer plus intermediate tool outputs and decisions
- Redact PII at the edge. Encrypt traces at rest. Keep a retention policy
Evaluation with teeth
- Golden sets by scenario, not just random samples. Include long input, adversarial input, known tricky entities
- Mix judges: rule checks, LLM judges, and task-specific metrics
- Shadow deploy and canary any change to model, embedding, index, or prompt
- Require non-regression thresholds on quality and latency to ship
Keep orchestration legible
- Prefer explicit code over magic DAGs for core logic
- One prompt per responsibility. Small, named, versioned
- Tool calls are idempotent with timeouts and retries that you control
Multi-tenant isolation as a feature
- Per-tenant concurrency and token rate limits
- Separate queues and caches for large tenants
- Cost attribution at the request level and monthly caps with alerts
Business impact in real numbers
- Latency: cutting unbounded context and adding rerankers typically drops p95 by 30 to 60 percent
- Cost: pre-LLM caching plus token budgets often reduces spend 25 to 50 percent. In one case, cache and rerank lowered generation spend from 42k to 19k per month
- Reliability: circuit breakers and backpressure take 429 error rates from bursts of 15 percent to under 1 percent during provider incidents
- Quality: retrieval governance and real eval stop the slow drift that drives support cost. Expect 20 to 40 percent reduction in escalations for RAG-heavy products
What to remember
- Put guardrails around tokens, not just requests
- Retrieval is a data system. Version it like one
- Backpressure and circuit breakers are mandatory if you call external models
- Cache earlier in the flow. Normalize everything
- Eval is a release gate, not a dashboard after the fact
- Observability needs lineage: model, prompt, retrieval, and costs tied to the same trace
If this sounds familiar
If you see rising p95, climbing token bills, or quality that drifts month to month, it is probably not a model problem. It is architecture. This is exactly the kind of work I help teams untangle when systems start breaking at scale. Happy to take a look at your traces, budgets, and retrieval pipeline and point out where the leaks are. Understanding the importance of model reliability in AI is crucial for maintaining system integrity and performance. When models fail to deliver consistent results, it can lead to significant operational issues and user dissatisfaction. Ensuring robust testing and validation processes can help mitigate these risks and support long-term success.

