Common mistakes in AI architecture design that cost you uptime, accuracy, and money

Posted by

The recurring smell

Most AI outages I get called into are not model problems. They are architecture problems disguised as model issues. Latency spikes, random failures, wrong answers, costs drifting 3x in a quarter. Everyone blames the LLM. Nine times out of ten, it is control planes with no guardrails, retrieval that drifts over time, or retry storms you created yourself.

I have seen teams spend weeks swapping models while a single missing backpressure valve kept their production p95 at 15 seconds.

Where this shows up and why

  • RAG systems that worked in a demo, then collapse when the index grows 10x
  • Agents that loop tools and eat your monthly budget in a weekend
  • Multi-tenant apps where one noisy customer takes everyone down
  • Model upgrades that silently degrade quality because eval is weak or missing

Why it happens:

  • PoC code becomes production by inertia
  • LLM treated as a pure function, but the workload is stateful and spiky
  • Over-trusting a vector database to solve retrieval without governance
  • No observability that ties cost, latency, and content back to a specific query and version

What most teams misunderstand:

  • More context is not better. It is slower, costlier, and often reduces accuracy
  • Fine-tuning is not a silver bullet. Data quality and orchestration usually matter more
  • “We log prompts” is not observability. You need structured traces with lineage and versioning

Deep dive into the mistakes

1) Stateless thinking in a stateful workflow

Symptoms: long chats with ballooning context, tool-call loops, inconsistent answers between turns.

Why: requests are stateless, but the task is not. Without session policy, token budgets, and memory discipline, you create unbounded context.

Failure mode: p95 latency and cost creep up over time. Accuracy drops as the prompt becomes noise.

2) Retrieval by vibes

Symptoms: top k=5 stuffed into the prompt, irrelevant chunks, month-old indexes, and silent schema drift.

Why: teams skip document normalization, chunk strategy, and metadata governance. They never version embeddings or indexes.

Failure mode: evaluation looks fine on day 1, degradation begins on day 30, supports tickets pile up by day 90.

3) No backpressure or circuit breakers

Symptoms: provider 429, cascading retries, timeouts that fan out to every dependency.

Why: direct synchronous calls to LLMs with optimistic concurrency. No queues, no rate limiters, no budget caps.

Failure mode: a single spike or a minor provider incident takes down your entire path.

4) Over-abstracted orchestration

Symptoms: a pretty graph of nodes that hides real costs and token paths. Hard to debug, impossible to tune.

Why: library-first design. Business logic gets buried under a flow framework with magic retries and hidden prompts.

Failure mode: fragmented prompts, duplicate tool calls, and ghost retries you cannot turn off.

5) Caching the wrong thing

Symptoms: 2 percent cache hit rate and negligible savings.

Why: caching full completions with variable headers or timestamps. No normalization, no tiered caches, no pre-LLM caching.

Failure mode: you pay for the same retrieval, rerank, and system prompt again and again.

6) Premature fine-tuning

Symptoms: expensive training cycles to fix what is really prompt discipline or retrieval quality.

Why: fine-tuning feels like control. The issue was data freshness, schema, or guardrails.

Failure mode: model lock-in, infra sprawl, and no measurable uplift.

7) Weak multi-tenant isolation

Symptoms: one large customer burns through RPS and tokens, everyone else slows down.

Why: global limits only. No per-tenant concurrency, no per-tenant cost caps, no shard-aware caches.

Failure mode: noisy neighbor plus confusing bills.

8) Thin evaluation and no change management

Symptoms: model upgrade ships. Support volume jumps. Nobody knows why.

Why: narrow golden sets, no hallucination checks, no adversarial prompts, no regression gates, no shadowing.

Failure mode: quality whiplash on every dependency change across model, embeddings, or index.

9) Missing observability where it matters

Symptoms: logs everywhere, insight nowhere.

Why: unstructured logs, no request IDs, no token or cost annotation, no linkage from final answer to retrieved docs and versions.

Failure mode: you cannot reproduce or defend a single output.

Practical fixes that hold up

Put the control plane first

  • Per-tenant quotas: RPS, concurrent calls, and token budgets
  • Cost guards: max tokens per request, per flow, and per session
  • Circuit breakers per provider and per model family, with clear fallback chains
  • Backpressure: queue spikes, not people. Use priority queues for human-in-the-loop paths

Treat context as a scarce resource

  • Hard token budgets per step. Force summarization or truncation by role: system > instructions > user > retrieved
  • Dedup retrieval by document ID and section. Do not paste the same paragraph twice
  • Rerank before generate. Rerankers are cheap. Generations are not

Make retrieval boring and reliable

  • Normalize and chunk by semantic units, not fixed windows only
  • Attach tight metadata: source, section, version, timestamp, permissions
  • Version embeddings and indexes. Store embedding model name, dim, and creation time
  • Refresh policies: incremental builds daily, full rebuilds weekly or on schema change

Cache where it matters

  • Tier 1: pre-LLM cache for deterministic steps, like retrieval results and rerank outputs keyed by normalized query and index version
  • Tier 2: prompt template plus normalized variables for short deterministic generations
  • Normalize inputs: lowercase, strip timestamps, sort keys
  • Track hit rate and dollar savings per cache layer

Observability that survives audits

  • Structured tracing with a single correlation ID per request and per session
  • Log token counts, latency, model name, prompt hash, and exact retrieved document IDs with versions
  • Capture final answer plus intermediate tool outputs and decisions
  • Redact PII at the edge. Encrypt traces at rest. Keep a retention policy

Evaluation with teeth

  • Golden sets by scenario, not just random samples. Include long input, adversarial input, known tricky entities
  • Mix judges: rule checks, LLM judges, and task-specific metrics
  • Shadow deploy and canary any change to model, embedding, index, or prompt
  • Require non-regression thresholds on quality and latency to ship

Keep orchestration legible

  • Prefer explicit code over magic DAGs for core logic
  • One prompt per responsibility. Small, named, versioned
  • Tool calls are idempotent with timeouts and retries that you control

Multi-tenant isolation as a feature

  • Per-tenant concurrency and token rate limits
  • Separate queues and caches for large tenants
  • Cost attribution at the request level and monthly caps with alerts

Business impact in real numbers

  • Latency: cutting unbounded context and adding rerankers typically drops p95 by 30 to 60 percent
  • Cost: pre-LLM caching plus token budgets often reduces spend 25 to 50 percent. In one case, cache and rerank lowered generation spend from 42k to 19k per month
  • Reliability: circuit breakers and backpressure take 429 error rates from bursts of 15 percent to under 1 percent during provider incidents
  • Quality: retrieval governance and real eval stop the slow drift that drives support cost. Expect 20 to 40 percent reduction in escalations for RAG-heavy products

What to remember

  • Put guardrails around tokens, not just requests
  • Retrieval is a data system. Version it like one
  • Backpressure and circuit breakers are mandatory if you call external models
  • Cache earlier in the flow. Normalize everything
  • Eval is a release gate, not a dashboard after the fact
  • Observability needs lineage: model, prompt, retrieval, and costs tied to the same trace

If this sounds familiar

If you see rising p95, climbing token bills, or quality that drifts month to month, it is probably not a model problem. It is architecture. This is exactly the kind of work I help teams untangle when systems start breaking at scale. Happy to take a look at your traces, budgets, and retrieval pipeline and point out where the leaks are.