{"id":64,"date":"2025-03-18T10:42:17","date_gmt":"2025-03-18T10:42:17","guid":{"rendered":"https:\/\/angirash.in\/blog\/2025\/03\/18\/designing-ai-systems-for-reliability-not-just-accuracy\/"},"modified":"2026-04-10T19:06:28","modified_gmt":"2026-04-10T19:06:28","slug":"designing-ai-systems-for-reliability-not-just-accuracy","status":"publish","type":"post","link":"https:\/\/angirash.in\/blog\/2025\/03\/18\/designing-ai-systems-for-reliability-not-just-accuracy\/","title":{"rendered":"Stop chasing model accuracy. Design for reliability."},"content":{"rendered":"<h2>The outage did not care about your 82% accuracy<\/h2>\n<p>Your eval showed 82% accuracy last week. PagerDuty still went off at 2:13 AM because:<\/p>\n<ul>\n<li>The vector DB had a 99th percentile spike and your RAG pipeline timed out.<\/li>\n<li>The model started returning invalid JSON after a silent server-side update.<\/li>\n<li><a href=\"https:\/\/angirash.in\/blog\/2025\/07\/14\/debugging-ai-systems-harder-than-software\/\">A tool call loop retried 4 times<\/a> and exhausted the token budget, so the agent never reached the final step.<\/li>\n<\/ul>\n<p><a href=\"https:\/\/angirash.in\/blog\/2025\/03\/14\/why-your-ai-evaluation-metrics-are-misleading\/\">Accuracy matters. Users do not experience accuracy<\/a>. They experience reliability. Does it answer the question, within the SLA, grounded in the right data, every single time? That is a system design problem, not a model benchmark problem.<\/p>\n<h2>Where teams get burned<\/h2>\n<ul>\n<li>Customer support assistants that pass offline QA but hallucinate product names during a catalog rollout because the index lagged 48 minutes.<\/li>\n<li>Report generators that work on staging but fail in prod when the LLM emits a trailing comma and the JSON parser explodes.<\/li>\n<li>Agents that ace a demo, then deadlock in a rare tool error path. The replay shows the orchestrator happily retrying the same broken function with the same arguments.<\/li>\n<\/ul>\n<p>Why it happens:<\/p>\n<ul>\n<li>Benchmarks optimize for average case. Reliability is about tails. p95 and p99 behavior is where production lives.<\/li>\n<li><a href=\"https:\/\/angirash.in\/blog\/2025\/06\/15\/common-mistakes-in-ai-architecture-design\/\">Nondeterminism everywhere<\/a>. Model sampling, retrieval variance, network jitter, vendor model updates, schema drift.<\/li>\n<li>Missing contracts. Inputs, outputs, timeouts, and error budgets are hand-waved behind prompt magic.<\/li>\n<\/ul>\n<p>What teams misunderstand:<\/p>\n<ul>\n<li>Improving model accuracy rarely fixes a system with poor failure handling. You just get wrong answers faster.<\/li>\n<li>RAG is not a silver bullet. If retrieval quality and freshness are not monitored, grounding claims are theater.<\/li>\n<li>LLM-judges do not replace deterministic checks. Use them to triage, not to authorize critical actions.<\/li>\n<\/ul>\n<h2>Technical deep dive: reliability is an architecture property<\/h2>\n<p>Design the system around SLOs and error budgets, not just eval scores.<\/p>\n<h3>Define capabilities and SLOs<\/h3>\n<p>For each user-facing capability, <a href=\"https:\/\/angirash.in\/blog\/2025\/11\/12\/ai-observability-stop-guessing-start-instrumenting\/\">write the contract in measurable terms<\/a>:<\/p>\n<ul>\n<li>Task success rate: example, grounded answer with at least two citations from the approved corpus, 99.2% weekly SLO.<\/li>\n<li>Latency: p50 1.5s, p95 4s. Set per step and end-to-end.<\/li>\n<li>Freshness: indexed content lag under 5 minutes for tier A docs.<\/li>\n<li>Structured output validity: JSON schema valid 99.9%.<\/li>\n<\/ul>\n<p>Tie these to error budgets. If you burn 50% of budget by Wednesday, you slow feature rollouts or reduce exploration temperature.<\/p>\n<h3>Orchestrate with a state machine, not spaghetti<\/h3>\n<ul>\n<li>Make steps explicit: retrieve, synthesize, verify, format, finalize.<\/li>\n<li>Per step timeouts, backoff with jitter, and idempotency keys for tool calls.<\/li>\n<li><a href=\"https:\/\/angirash.in\/blog\/2025\/07\/14\/designing-low-latency-ai-systems-real-time\/\">Circuit breakers for external providers<\/a>. If a model or DB is flaking, short circuit to a fallback or degrade mode.<\/li>\n<li>Concurrency control. Limit parallel tools to avoid thundering herds on shared backends.<\/li>\n<\/ul>\n<h3>Control nondeterminism<\/h3>\n<ul>\n<li>Use temperature, top_p, and grammar-constrained decoding for structured outputs. Prefer grammar or JSON schema guided decoding over fragile regex fixes.<\/li>\n<li><a href=\"https:\/\/angirash.in\/blog\/2025\/03\/18\/versioning-in-llm-systems-what-actually-matters\/\">Vendor updates introduce drift. Pin model versions<\/a> where possible. Shadow test new versions with traffic sampling before promotion.<\/li>\n<li><a href=\"https:\/\/angirash.in\/blog\/2025\/07\/14\/caching-strategies-for-llm-systems-that-actually-work\/\">Cache at the right layers<\/a>: retrieval results, tool results, and final answers with semantic cache only if you track prompt and context hashes to avoid stale leakage.<\/li>\n<\/ul>\n<h3>Retrieval is a system inside your system<\/h3>\n<p>Common RAG failure modes I keep seeing:<\/p>\n<ul>\n<li>Embedding drift when you re-embed queries but not documents. Instant recall degradation.<\/li>\n<li>Chunking pathologies. Splitting tables and code blocks kills answerability.<\/li>\n<li>Index staleness and backfill gaps. Your metrics see traffic, not coverage.<\/li>\n<li>Top-k too low or uncalibrated per query type. One size does not fit all.<\/li>\n<\/ul>\n<p>Controls that help:<\/p>\n<ul>\n<li>Dual encoders or query rewriting for known query classes. Do not trust a single default prompt.<\/li>\n<li><a href=\"https:\/\/angirash.in\/blog\/2025\/02\/24\/chunking-that-actually-improves-retrieval\/\">Freshness monitors: canary docs re-queried every minute<\/a> with alerts if missing.<\/li>\n<li>A small set of golden queries with expected citations. Run them continuously in prod.<\/li>\n<li>Automatic top-k tuning based on retrieval confidence scores. If confidence is low, expand k or abstain.<\/li>\n<\/ul>\n<h3>Tooling and structured I\/O contracts<\/h3>\n<ul>\n<li>Make a JSON schema the source of truth for tool arguments and model outputs. Enforce at runtime. Reject early with helpful retries.<\/li>\n<li>Constrained decoding or function calling with strict schema reduces invalid output by an order of magnitude in most stacks.<\/li>\n<li>Use a normalizer layer. Strip trailing commas, fix common date formats, clamp numbers to expected ranges. Fail closed if normalization cannot repair.<\/li>\n<li>In one audit, 17% of failures were just malformed JSON because of a newline in a string field. Fixing decoding removed the top incident class in 24 hours.<\/li>\n<\/ul>\n<h3>Observability for AI, not just microservices<\/h3>\n<ul>\n<li>Trace every step with a correlation ID across retrieval, model calls, and tools. Include prompt version, input hash, and feature flags.<\/li>\n<li>Log minimal but sufficient context. Token dumps are a privacy trap. Hash PII, store references, not bodies.<\/li>\n<li>Build first-class metrics: grounded answer rate, abstain rate, schema validity rate, tool success rate, retrieval coverage, and budget burn down.<\/li>\n<li>Keep a replay harness. Given an event, you should be able to rerun the flow deterministically with pinned versions and seeds.<\/li>\n<\/ul>\n<h2>Practical solutions that hold up in production<\/h2>\n<h3>1) Choose a reliability strategy, not just a model<\/h3>\n<ul>\n<li>Two-tier inference: start with a fast model, escalate to a stronger model when confidence is low or when the user is high value. Confidence can come from retrieval coverage, self-checkers, or simple heuristics like missing citations.<\/li>\n<li>Degrade modes: if vector DB is unhealthy, fall back to keyword search plus tighter prompts, or return a graceful abstain with next steps. A wrong confident answer is more expensive than an honest abstain.<\/li>\n<li>Provider failover: keep at least two vendors integrated. Health check them and route using circuit breakers. Do not flip vendors without shadow tests.<\/li>\n<\/ul>\n<p>Trade-offs:<\/p>\n<ul>\n<li>Escalation increases tail latency. Budget it. Users forgive 4s once in a while if answers are correct and consistent.<\/li>\n<li>Degrade modes reduce conversion if overused. Tune thresholds and watch your error budget.<\/li>\n<\/ul>\n<h3>2) Guard your structured outputs<\/h3>\n<ul>\n<li>Use a grammar or JSON schema with constrained decoding. This alone often moves JSON validity from 97% to 99.9%.<\/li>\n<li>Add a lightweight validator and normalizer. Retry once with a repair hint. If still invalid, escalate or return a typed error the client can handle.<\/li>\n<\/ul>\n<h3>3) Make retrieval prove its work<\/h3>\n<ul>\n<li>Require citations and verify that they come from the allowed corpus. Reject if missing or off-domain.<\/li>\n<li>Compute a grounding score. If below threshold, either expand retrieval or abstain.<\/li>\n<li>Refresh embeddings and indexes in lockstep with a version tag. Traffic routes by version until parity is confirmed.<\/li>\n<\/ul>\n<h3>4) Version everything that changes outcomes<\/h3>\n<ul>\n<li>Prompts, tools, retrievers, chunkers, ranking rules, and models. Attach versions to traces and experiments.<\/li>\n<li>Use feature flags for rollout. Percent based, per tenant if needed. Roll forward or back instantly.<\/li>\n<\/ul>\n<h3>5) Test in the tails<\/h3>\n<ul>\n<li>Chaos tests: kill the vector DB, raise rate limits, inject invalid tool responses. Observe degrade modes and p99s.<\/li>\n<li>Golden tasks: production queries that matter. Run continuously and alert on failures within minutes, not days.<\/li>\n<li>Shadow traffic: new model or prompt runs in parallel and is scored without user impact.<\/li>\n<\/ul>\n<h3>6) Build the right retry posture<\/h3>\n<ul>\n<li>Retries hide incidents and burn tokens. Cap them. Use exponential backoff with jitter. Make retries idempotent with request keys.<\/li>\n<li>Prefer diversify-then-vote over blind retries. If you must retry, vary the prompt or tool choice.<\/li>\n<\/ul>\n<h3>7) Decide when to be deterministic<\/h3>\n<ul>\n<li>Compliance and billing flows should minimize sampling. Use top_p near zero, or models with deterministic decoding. Consider small, fine-tuned models for narrow tasks where you can guarantee schema adherence.<\/li>\n<li>Creative tasks can be looser but still bounded by contracts.<\/li>\n<\/ul>\n<h2>Business impact that leadership actually feels<\/h2>\n<ul>\n<li>Support cost: a single hallucinated refund policy can generate a queue of tickets. Reliability cuts these at the source.<\/li>\n<li>Conversion: consistent answers drive trust. In one B2B workflow, moving grounded answer rate from 96.8% to 99.2% increased task completion by 7% with no model change, only orchestration fixes.<\/li>\n<li>Cost control: fewer retries, fewer escalations, and better caching reduce token burn by 20 to 40% in most stacks I tune.<\/li>\n<li>Vendor risk: dual-vendor setups with circuit breakers turn vendor incidents from outages into minor blips. This matters during quarter-end crunches.<\/li>\n<li>Scaling: if p95 is unstable at 1x traffic, expect a cliff at 3x. Reliability work flattens that curve.<\/li>\n<\/ul>\n<h2>Key takeaways<\/h2>\n<ul>\n<li>Reliability is an end-to-end contract, not a model metric.<\/li>\n<li>Design around SLOs, error budgets, and degrade modes.<\/li>\n<li>Constrain and validate structured outputs with real schemas.<\/li>\n<li>Treat retrieval as a product with its own SLOs and monitors.<\/li>\n<li>Version everything that can change outcomes and trace it.<\/li>\n<li>Retries are not a strategy. Escalate smartly or abstain.<\/li>\n<li>Shadow and canary before you flip anything in prod.<\/li>\n<\/ul>\n<h2>If this is hitting close to home<\/h2>\n<p>If your stack passes offline evals but breaks in the tails, you are not alone. I help teams put reliability into the architecture so the fire drills stop and the business metrics move. If you want a second set of eyes on your orchestration, retrieval, or SLO design, this is exactly the kind of work I do when systems start breaking at scale.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The outage did not care about your 82% accuracy Your eval showed 82% accuracy last week. PagerDuty still went off at 2:13 AM because: The vector DB had a 99th&#8230; <\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[3],"tags":[26,22,15],"class_list":["post-64","post","type-post","status-publish","format-standard","hentry","category-ai-architecture","tag-ai-eval","tag-ai-observability","tag-ai-system-design"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/posts\/64","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/comments?post=64"}],"version-history":[{"count":1,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/posts\/64\/revisions"}],"predecessor-version":[{"id":160,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/posts\/64\/revisions\/160"}],"wp:attachment":[{"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/media?parent=64"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/categories?post=64"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/tags?post=64"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}