The outage did not care about your 82% accuracy
Your eval showed 82% accuracy last week. PagerDuty still went off at 2:13 AM because:
- The vector DB had a 99th percentile spike and your RAG pipeline timed out.
- The model started returning invalid JSON after a silent server-side update.
- A tool call loop retried 4 times and exhausted the token budget, so the agent never reached the final step.
Accuracy matters. Users do not experience accuracy. They experience reliability. Does it answer the question, within the SLA, grounded in the right data, every single time? That is a system design problem, not a model benchmark problem.
Where teams get burned
- Customer support assistants that pass offline QA but hallucinate product names during a catalog rollout because the index lagged 48 minutes.
- Report generators that work on staging but fail in prod when the LLM emits a trailing comma and the JSON parser explodes.
- Agents that ace a demo, then deadlock in a rare tool error path. The replay shows the orchestrator happily retrying the same broken function with the same arguments.
Why it happens:
- Benchmarks optimize for average case. Reliability is about tails. p95 and p99 behavior is where production lives.
- Nondeterminism everywhere. Model sampling, retrieval variance, network jitter, vendor model updates, schema drift.
- Missing contracts. Inputs, outputs, timeouts, and error budgets are hand-waved behind prompt magic.
What teams misunderstand:
- Improving model accuracy rarely fixes a system with poor failure handling. You just get wrong answers faster.
- RAG is not a silver bullet. If retrieval quality and freshness are not monitored, grounding claims are theater.
- LLM-judges do not replace deterministic checks. Use them to triage, not to authorize critical actions.
Technical deep dive: reliability is an architecture property
Design the system around SLOs and error budgets, not just eval scores.
Define capabilities and SLOs
For each user-facing capability, write the contract in measurable terms:
- Task success rate: example, grounded answer with at least two citations from the approved corpus, 99.2% weekly SLO.
- Latency: p50 1.5s, p95 4s. Set per step and end-to-end.
- Freshness: indexed content lag under 5 minutes for tier A docs.
- Structured output validity: JSON schema valid 99.9%.
Tie these to error budgets. If you burn 50% of budget by Wednesday, you slow feature rollouts or reduce exploration temperature.
Orchestrate with a state machine, not spaghetti
- Make steps explicit: retrieve, synthesize, verify, format, finalize.
- Per step timeouts, backoff with jitter, and idempotency keys for tool calls.
- Circuit breakers for external providers. If a model or DB is flaking, short circuit to a fallback or degrade mode.
- Concurrency control. Limit parallel tools to avoid thundering herds on shared backends.
Control nondeterminism
- Use temperature, top_p, and grammar-constrained decoding for structured outputs. Prefer grammar or JSON schema guided decoding over fragile regex fixes.
- Vendor updates introduce drift. Pin model versions where possible. Shadow test new versions with traffic sampling before promotion.
- Cache at the right layers: retrieval results, tool results, and final answers with semantic cache only if you track prompt and context hashes to avoid stale leakage.
Retrieval is a system inside your system
Common RAG failure modes I keep seeing:
- Embedding drift when you re-embed queries but not documents. Instant recall degradation.
- Chunking pathologies. Splitting tables and code blocks kills answerability.
- Index staleness and backfill gaps. Your metrics see traffic, not coverage.
- Top-k too low or uncalibrated per query type. One size does not fit all.
Controls that help:
- Dual encoders or query rewriting for known query classes. Do not trust a single default prompt.
- Freshness monitors: canary docs re-queried every minute with alerts if missing.
- A small set of golden queries with expected citations. Run them continuously in prod.
- Automatic top-k tuning based on retrieval confidence scores. If confidence is low, expand k or abstain.
Tooling and structured I/O contracts
- Make a JSON schema the source of truth for tool arguments and model outputs. Enforce at runtime. Reject early with helpful retries.
- Constrained decoding or function calling with strict schema reduces invalid output by an order of magnitude in most stacks.
- Use a normalizer layer. Strip trailing commas, fix common date formats, clamp numbers to expected ranges. Fail closed if normalization cannot repair.
- In one audit, 17% of failures were just malformed JSON because of a newline in a string field. Fixing decoding removed the top incident class in 24 hours.
Observability for AI, not just microservices
- Trace every step with a correlation ID across retrieval, model calls, and tools. Include prompt version, input hash, and feature flags.
- Log minimal but sufficient context. Token dumps are a privacy trap. Hash PII, store references, not bodies.
- Build first-class metrics: grounded answer rate, abstain rate, schema validity rate, tool success rate, retrieval coverage, and budget burn down.
- Keep a replay harness. Given an event, you should be able to rerun the flow deterministically with pinned versions and seeds.
Practical solutions that hold up in production
1) Choose a reliability strategy, not just a model
- Two-tier inference: start with a fast model, escalate to a stronger model when confidence is low or when the user is high value. Confidence can come from retrieval coverage, self-checkers, or simple heuristics like missing citations.
- Degrade modes: if vector DB is unhealthy, fall back to keyword search plus tighter prompts, or return a graceful abstain with next steps. A wrong confident answer is more expensive than an honest abstain.
- Provider failover: keep at least two vendors integrated. Health check them and route using circuit breakers. Do not flip vendors without shadow tests.
Trade-offs:
- Escalation increases tail latency. Budget it. Users forgive 4s once in a while if answers are correct and consistent.
- Degrade modes reduce conversion if overused. Tune thresholds and watch your error budget.
2) Guard your structured outputs
- Use a grammar or JSON schema with constrained decoding. This alone often moves JSON validity from 97% to 99.9%.
- Add a lightweight validator and normalizer. Retry once with a repair hint. If still invalid, escalate or return a typed error the client can handle.
3) Make retrieval prove its work
- Require citations and verify that they come from the allowed corpus. Reject if missing or off-domain.
- Compute a grounding score. If below threshold, either expand retrieval or abstain.
- Refresh embeddings and indexes in lockstep with a version tag. Traffic routes by version until parity is confirmed.
4) Version everything that changes outcomes
- Prompts, tools, retrievers, chunkers, ranking rules, and models. Attach versions to traces and experiments.
- Use feature flags for rollout. Percent based, per tenant if needed. Roll forward or back instantly.
5) Test in the tails
- Chaos tests: kill the vector DB, raise rate limits, inject invalid tool responses. Observe degrade modes and p99s.
- Golden tasks: production queries that matter. Run continuously and alert on failures within minutes, not days.
- Shadow traffic: new model or prompt runs in parallel and is scored without user impact.
6) Build the right retry posture
- Retries hide incidents and burn tokens. Cap them. Use exponential backoff with jitter. Make retries idempotent with request keys.
- Prefer diversify-then-vote over blind retries. If you must retry, vary the prompt or tool choice.
7) Decide when to be deterministic
- Compliance and billing flows should minimize sampling. Use top_p near zero, or models with deterministic decoding. Consider small, fine-tuned models for narrow tasks where you can guarantee schema adherence.
- Creative tasks can be looser but still bounded by contracts.
Business impact that leadership actually feels
- Support cost: a single hallucinated refund policy can generate a queue of tickets. Reliability cuts these at the source.
- Conversion: consistent answers drive trust. In one B2B workflow, moving grounded answer rate from 96.8% to 99.2% increased task completion by 7% with no model change, only orchestration fixes.
- Cost control: fewer retries, fewer escalations, and better caching reduce token burn by 20 to 40% in most stacks I tune.
- Vendor risk: dual-vendor setups with circuit breakers turn vendor incidents from outages into minor blips. This matters during quarter-end crunches.
- Scaling: if p95 is unstable at 1x traffic, expect a cliff at 3x. Reliability work flattens that curve.
Key takeaways
- Reliability is an end-to-end contract, not a model metric.
- Design around SLOs, error budgets, and degrade modes.
- Constrain and validate structured outputs with real schemas.
- Treat retrieval as a product with its own SLOs and monitors.
- Version everything that can change outcomes and trace it.
- Retries are not a strategy. Escalate smartly or abstain.
- Shadow and canary before you flip anything in prod.
If this is hitting close to home
If your stack passes offline evals but breaks in the tails, you are not alone. I help teams put reliability into the architecture so the fire drills stop and the business metrics move. If you want a second set of eyes on your orchestration, retrieval, or SLO design, this is exactly the kind of work I do when systems start breaking at scale.

