{"id":29,"date":"2025-06-15T10:23:41","date_gmt":"2025-06-15T10:23:41","guid":{"rendered":"https:\/\/angirash.in\/blog\/2025\/07\/14\/common-mistakes-in-ai-architecture-design\/"},"modified":"2026-04-09T23:27:59","modified_gmt":"2026-04-09T23:27:59","slug":"common-mistakes-in-ai-architecture-design","status":"publish","type":"post","link":"https:\/\/angirash.in\/blog\/2025\/06\/15\/common-mistakes-in-ai-architecture-design\/","title":{"rendered":"Common mistakes in AI architecture design that cost you uptime, accuracy, and money"},"content":{"rendered":"<h2>The recurring smell<\/h2>\n<p>Most AI outages I get called into are not model problems. They are architecture problems disguised as model issues. Latency spikes, random failures, wrong answers, costs drifting 3x in a quarter. Everyone blames the LLM. Nine times out of ten, it is control planes with no guardrails, retrieval that drifts over time, or retry storms you created yourself.<\/p>\n<p>I have seen teams spend weeks swapping models while a single missing backpressure valve kept their production p95 at 15 seconds.<\/p>\n<h2>Where this shows up and why<\/h2>\n<ul>\n<li>RAG systems that worked in a demo, then collapse when the index grows 10x<\/li>\n<li>Agents that loop tools and eat your monthly budget in a weekend<\/li>\n<li>Multi-tenant apps where one noisy customer takes everyone down<\/li>\n<li>Model upgrades that silently degrade quality because eval is weak or missing<\/li>\n<\/ul>\n<p>Why it happens:<\/p>\n<ul>\n<li>PoC code becomes production by inertia<\/li>\n<li>LLM treated as a pure function, but the workload is stateful and spiky<\/li>\n<li>Over-trusting a vector database to solve retrieval without governance<\/li>\n<li>No observability that ties cost, latency, and content back to a specific query and version<\/li>\n<\/ul>\n<p>What most teams misunderstand:<\/p>\n<ul>\n<li>More context is not better. It is slower, costlier, and often reduces accuracy<\/li>\n<li>Fine-tuning is not a silver bullet. Data quality and orchestration usually matter more<\/li>\n<li>\u201cWe log prompts\u201d is not observability. You need structured traces with lineage and versioning<\/li>\n<\/ul>\n<h2>Deep dive into the mistakes<\/h2>\n<h3>1) Stateless thinking in a stateful workflow<\/h3>\n<p>Symptoms: long chats with ballooning context, tool-call loops, inconsistent answers between turns.<\/p>\n<p>Why: requests are stateless, but the task is not. Without session policy, token budgets, and memory discipline, you create unbounded context.<\/p>\n<p>Failure mode: p95 latency and cost creep up over time. Accuracy drops as the prompt becomes noise.<\/p>\n<h3>2) Retrieval by vibes<\/h3>\n<p>Symptoms: top k=5 stuffed into the prompt, irrelevant chunks, month-old indexes, and silent schema drift.<\/p>\n<p>Why: teams skip document normalization, chunk strategy, and metadata governance. They never version embeddings or indexes.<\/p>\n<p>Failure mode: evaluation looks fine on day 1, degradation begins on day 30, supports tickets pile up by day 90.<\/p>\n<h3>3) No backpressure or circuit breakers<\/h3>\n<p>Symptoms: provider 429, cascading retries, timeouts that fan out to every dependency.<\/p>\n<p>Why: direct synchronous calls to LLMs with optimistic concurrency. No queues, no rate limiters, no budget caps.<\/p>\n<p>Failure mode: a single spike or a minor provider incident takes down your entire path.<\/p>\n<h3>4) Over-abstracted orchestration<\/h3>\n<p>Symptoms: a pretty graph of nodes that hides real costs and token paths. Hard to debug, impossible to tune.<\/p>\n<p>Why: library-first design. Business logic gets buried under a flow framework with magic retries and hidden prompts.<\/p>\n<p>Failure mode: fragmented prompts, duplicate tool calls, and ghost retries you cannot turn off.<\/p>\n<h3>5) Caching the wrong thing<\/h3>\n<p>Symptoms: 2 percent cache hit rate and negligible savings.<\/p>\n<p>Why: caching full completions with variable headers or timestamps. No normalization, no tiered caches, no pre-LLM caching.<\/p>\n<p>Failure mode: you pay for the same retrieval, rerank, and system prompt again and again.<\/p>\n<h3>6) Premature fine-tuning<\/h3>\n<p>Symptoms: expensive training cycles to fix what is really prompt discipline or retrieval quality.<\/p>\n<p>Why: fine-tuning feels like control. The issue was data freshness, schema, or guardrails.<\/p>\n<p>Failure mode: model lock-in, infra sprawl, and no measurable uplift.<\/p>\n<h3>7) Weak multi-tenant isolation<\/h3>\n<p>Symptoms: one large customer burns through RPS and tokens, everyone else slows down.<\/p>\n<p>Why: global limits only. No per-tenant concurrency, no per-tenant cost caps, no shard-aware caches.<\/p>\n<p>Failure mode: noisy neighbor plus confusing bills.<\/p>\n<h3>8) Thin evaluation and no change management<\/h3>\n<p>Symptoms: model upgrade ships. Support volume jumps. Nobody knows why.<\/p>\n<p>Why: narrow golden sets, no hallucination checks, no adversarial prompts, no regression gates, no shadowing.<\/p>\n<p>Failure mode: quality whiplash on every dependency change across model, embeddings, or index.<\/p>\n<h3>9) Missing observability where it matters<\/h3>\n<p>Symptoms: logs everywhere, insight nowhere.<\/p>\n<p>Why: unstructured logs, no request IDs, no token or cost annotation, no linkage from final answer to retrieved docs and versions.<\/p>\n<p>Failure mode: you cannot reproduce or defend a single output.<\/p>\n<h2>Practical fixes that hold up<\/h2>\n<h3>Put the control plane first<\/h3>\n<ul>\n<li>Per-tenant quotas: RPS, concurrent calls, and token budgets<\/li>\n<li>Cost guards: max tokens per request, per flow, and per session<\/li>\n<li>Circuit breakers per provider and per model family, with clear fallback chains<\/li>\n<li>Backpressure: queue spikes, not people. Use priority queues for human-in-the-loop paths<\/li>\n<\/ul>\n<h3>Treat context as a scarce resource<\/h3>\n<ul>\n<li>Hard token budgets per step. Force summarization or truncation by role: system > instructions > user > retrieved<\/li>\n<li>Dedup retrieval by document ID and section. Do not paste the same paragraph twice<\/li>\n<li>Rerank before generate. Rerankers are cheap. Generations are not<\/li>\n<\/ul>\n<h3>Make retrieval boring and reliable<\/h3>\n<ul>\n<li>Normalize and chunk by semantic units, not fixed windows only<\/li>\n<li>Attach tight metadata: source, section, version, timestamp, permissions<\/li>\n<li>Version embeddings and indexes. Store embedding model name, dim, and creation time<\/li>\n<li>Refresh policies: incremental builds daily, full rebuilds weekly or on schema change<\/li>\n<\/ul>\n<h3>Cache where it matters<\/h3>\n<ul>\n<li>Tier 1: pre-LLM cache for deterministic steps, like retrieval results and rerank outputs keyed by normalized query and index version<\/li>\n<li>Tier 2: prompt template plus normalized variables for short deterministic generations<\/li>\n<li>Normalize inputs: lowercase, strip timestamps, sort keys<\/li>\n<li>Track hit rate and dollar savings per cache layer<\/li>\n<\/ul>\n<h3>Observability that survives audits<\/h3>\n<ul>\n<li>Structured tracing with a single correlation ID per request and per session<\/li>\n<li>Log token counts, latency, model name, prompt hash, and exact retrieved document IDs with versions<\/li>\n<li>Capture final answer plus intermediate tool outputs and decisions<\/li>\n<li>Redact PII at the edge. Encrypt traces at rest. Keep a retention policy<\/li>\n<\/ul>\n<h3>Evaluation with teeth<\/h3>\n<ul>\n<li>Golden sets by scenario, not just random samples. Include long input, adversarial input, known tricky entities<\/li>\n<li>Mix judges: rule checks, LLM judges, and task-specific metrics<\/li>\n<li>Shadow deploy and canary any change to model, embedding, index, or prompt<\/li>\n<li>Require non-regression thresholds on quality and latency to ship<\/li>\n<\/ul>\n<h3>Keep orchestration legible<\/h3>\n<ul>\n<li>Prefer explicit code over magic DAGs for core logic<\/li>\n<li>One prompt per responsibility. Small, named, versioned<\/li>\n<li>Tool calls are idempotent with timeouts and retries that you control<\/li>\n<\/ul>\n<h3>Multi-tenant isolation as a feature<\/h3>\n<ul>\n<li>Per-tenant concurrency and token rate limits<\/li>\n<li>Separate queues and caches for large tenants<\/li>\n<li>Cost attribution at the request level and monthly caps with alerts<\/li>\n<\/ul>\n<h2>Business impact in real numbers<\/h2>\n<ul>\n<li>Latency: cutting unbounded context and adding rerankers typically drops p95 by 30 to 60 percent<\/li>\n<li>Cost: pre-LLM caching plus token budgets often reduces spend 25 to 50 percent. In one case, cache and rerank lowered generation spend from 42k to 19k per month<\/li>\n<li>Reliability: circuit breakers and backpressure take 429 error rates from bursts of 15 percent to under 1 percent during provider incidents<\/li>\n<li>Quality: retrieval governance and real eval stop the slow drift that drives support cost. Expect 20 to 40 percent reduction in escalations for RAG-heavy products<\/li>\n<\/ul>\n<h2>What to remember<\/h2>\n<ul>\n<li>Put guardrails around tokens, not just requests<\/li>\n<li>Retrieval is a data system. Version it like one<\/li>\n<li>Backpressure and circuit breakers are mandatory if you call external models<\/li>\n<li>Cache earlier in the flow. Normalize everything<\/li>\n<li>Eval is a release gate, not a dashboard after the fact<\/li>\n<li>Observability needs lineage: model, prompt, retrieval, and costs tied to the same trace<\/li>\n<\/ul>\n<h2>If this sounds familiar<\/h2>\n<p>If you see rising p95, climbing token bills, or quality that drifts month to month, it is probably not a model problem. It is architecture. This is exactly the kind of work I help teams untangle when systems start breaking at scale. Happy to take a look at your traces, budgets, and retrieval pipeline and point out where the leaks are.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The recurring smell Most AI outages I get called into are not model problems. They are architecture problems disguised as model issues. Latency spikes, random failures, wrong answers, costs drifting&#8230; <\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[12],"tags":[17,22,15],"class_list":["post-29","post","type-post","status-publish","format-standard","hentry","category-ai-failures","tag-ai-cost","tag-ai-observability","tag-ai-system-design"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/posts\/29","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/comments?post=29"}],"version-history":[{"count":1,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/posts\/29\/revisions"}],"predecessor-version":[{"id":90,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/posts\/29\/revisions\/90"}],"wp:attachment":[{"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/media?parent=29"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/categories?post=29"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/tags?post=29"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}