Common mistakes in AI architecture design that cost you uptime, accuracy, and money

The recurring smell

Most AI outages I get called into are not model problems. They are architecture problems disguised as model issues. Latency spikes, random failures, wrong answers, costs drifting 3x in a quarter. Everyone blames the LLM. Nine times out of ten, it is control planes with no guardrails, retrieval that drifts over time, or retry storms you created yourself.

I have seen teams spend weeks swapping models while a single missing backpressure valve kept their production p95 at 15 seconds.

Where this shows up and why

RAG systems that worked in a demo, then collapse when the index grows 10x
Agents that loop tools and eat your monthly budget in a weekend
Multi-tenant apps where one noisy customer takes everyone down
Model upgrades that silently degrade quality because eval is weak or missing

Why it happens:

PoC code becomes production by inertia
LLM treated as a pure function, but the workload is stateful and spiky
Over-trusting a vector database to solve retrieval without governance
No observability that ties cost, latency, and content back to a specific query and version

What most teams misunderstand:

More context is not better. It is slower, costlier, and often reduces accuracy
Fine-tuning is not a silver bullet. Data quality and orchestration usually matter more
“We log prompts” is not observability. You need structured traces with lineage and versioning

Deep dive into the mistakes

1) Stateless thinking in a stateful workflow

Symptoms: long chats with ballooning context, tool-call loops, inconsistent answers between turns.

Why: requests are stateless, but the task is not. Without session policy, token budgets, and memory discipline, you create unbounded context.

Failure mode: p95 latency and cost creep up over time. Accuracy drops as the prompt becomes noise.

2) Retrieval by vibes

Symptoms: top k=5 stuffed into the prompt, irrelevant chunks, month-old indexes, and silent schema drift.

Why: teams skip document normalization, chunk strategy, and metadata governance. They never version embeddings or indexes.

Failure mode: evaluation looks fine on day 1, degradation begins on day 30, supports tickets pile up by day 90.

3) No backpressure or circuit breakers

Symptoms: provider 429, cascading retries, timeouts that fan out to every dependency.

Why: direct synchronous calls to LLMs with optimistic concurrency. No queues, no rate limiters, no budget caps.

Failure mode: a single spike or a minor provider incident takes down your entire path.

4) Over-abstracted orchestration

Symptoms: a pretty graph of nodes that hides real costs and token paths. Hard to debug, impossible to tune.

Why: library-first design. Business logic gets buried under a flow framework with magic retries and hidden prompts.

Failure mode: fragmented prompts, duplicate tool calls, and ghost retries you cannot turn off.

5) Caching the wrong thing

Symptoms: 2 percent cache hit rate and negligible savings.

Why: caching full completions with variable headers or timestamps. No normalization, no tiered caches, no pre-LLM caching.

Failure mode: you pay for the same retrieval, rerank, and system prompt again and again.

6) Premature fine-tuning

Symptoms: expensive training cycles to fix what is really prompt discipline or retrieval quality.

Why: fine-tuning feels like control. The issue was data freshness, schema, or guardrails.

Failure mode: model lock-in, infra sprawl, and no measurable uplift.

7) Weak multi-tenant isolation

Symptoms: one large customer burns through RPS and tokens, everyone else slows down.

Why: global limits only. No per-tenant concurrency, no per-tenant cost caps, no shard-aware caches.

Failure mode: noisy neighbor plus confusing bills.

8) Thin evaluation and no change management

Symptoms: model upgrade ships. Support volume jumps. Nobody knows why.

Why: narrow golden sets, no hallucination checks, no adversarial prompts, no regression gates, no shadowing.

Failure mode: quality whiplash on every dependency change across model, embeddings, or index.

9) Missing observability where it matters

Symptoms: logs everywhere, insight nowhere.

Why: unstructured logs, no request IDs, no token or cost annotation, no linkage from final answer to retrieved docs and versions.

Failure mode: you cannot reproduce or defend a single output.

Practical fixes that hold up

Put the control plane first

Per-tenant quotas: RPS, concurrent calls, and token budgets
Cost guards: max tokens per request, per flow, and per session
Circuit breakers per provider and per model family, with clear fallback chains
Backpressure: queue spikes, not people. Use priority queues for human-in-the-loop paths

Treat context as a scarce resource

Hard token budgets per step. Force summarization or truncation by role: system > instructions > user > retrieved
Dedup retrieval by document ID and section. Do not paste the same paragraph twice
Rerank before generate. Rerankers are cheap. Generations are not

Make retrieval boring and reliable

Normalize and chunk by semantic units, not fixed windows only
Attach tight metadata: source, section, version, timestamp, permissions
Version embeddings and indexes. Store embedding model name, dim, and creation time
Refresh policies: incremental builds daily, full rebuilds weekly or on schema change

Cache where it matters

Tier 1: pre-LLM cache for deterministic steps, like retrieval results and rerank outputs keyed by normalized query and index version
Tier 2: prompt template plus normalized variables for short deterministic generations
Normalize inputs: lowercase, strip timestamps, sort keys
Track hit rate and dollar savings per cache layer

Observability that survives audits

Structured tracing with a single correlation ID per request and per session
Log token counts, latency, model name, prompt hash, and exact retrieved document IDs with versions
Capture final answer plus intermediate tool outputs and decisions
Redact PII at the edge. Encrypt traces at rest. Keep a retention policy

Evaluation with teeth

Golden sets by scenario, not just random samples. Include long input, adversarial input, known tricky entities
Mix judges: rule checks, LLM judges, and task-specific metrics
Shadow deploy and canary any change to model, embedding, index, or prompt
Require non-regression thresholds on quality and latency to ship

Keep orchestration legible

Prefer explicit code over magic DAGs for core logic
One prompt per responsibility. Small, named, versioned
Tool calls are idempotent with timeouts and retries that you control

Multi-tenant isolation as a feature

Per-tenant concurrency and token rate limits
Separate queues and caches for large tenants
Cost attribution at the request level and monthly caps with alerts

Business impact in real numbers

Latency: cutting unbounded context and adding rerankers typically drops p95 by 30 to 60 percent
Cost: pre-LLM caching plus token budgets often reduces spend 25 to 50 percent. In one case, cache and rerank lowered generation spend from 42k to 19k per month
Reliability: circuit breakers and backpressure take 429 error rates from bursts of 15 percent to under 1 percent during provider incidents
Quality: retrieval governance and real eval stop the slow drift that drives support cost. Expect 20 to 40 percent reduction in escalations for RAG-heavy products

What to remember

Put guardrails around tokens, not just requests
Retrieval is a data system. Version it like one
Backpressure and circuit breakers are mandatory if you call external models
Cache earlier in the flow. Normalize everything
Eval is a release gate, not a dashboard after the fact
Observability needs lineage: model, prompt, retrieval, and costs tied to the same trace

If this sounds familiar

If you see rising p95, climbing token bills, or quality that drifts month to month, it is probably not a model problem. It is architecture. This is exactly the kind of work I help teams untangle when systems start breaking at scale. Happy to take a look at your traces, budgets, and retrieval pipeline and point out where the leaks are.