Why your AI architecture looks right on paper but fails in production

Posted by

The whiteboard looks perfect. The pager does not.

You can diagram a clean RAG pipeline in five minutes. Vector DB, LLM, a couple of services, job queue, done. It demoed fine at 400 ms. Then you ship, traffic spikes a little, and suddenly you are staring at 9 second P99s, vendors throttling you, and customer tickets about inconsistent answers. The diagram was not wrong. It was incomplete.

I see the same failure pattern across teams. The plan looks right, the system runs, and then real-world variance eats it alive.

Where this breaks and why

  • Distribution shift: Your offline eval used tidy data. Production sends misspellings, PDFs, half-empty records, and angry users who paste screenshots.
  • Latency amplification: Each harmless 150 ms hop stacks. Add retries and you build a latency multiplier. Tail latencies become the norm.
  • Hidden coupling: Retrieval parameters, embeddings, prompt templates, and tool contracts drift silently. You change one, three others degrade.
  • Vendor reality: Rate limits, intermittent timeouts, model version changes. Your code assumes a stable API, but the provider is a moving target.
  • Missing guardrails: No SLOs per step, no circuit breakers, no backpressure. One slow dependency drags the whole system.
  • Observability gap: You can see logs, not the truth. You do not trace from user intent to retrieval to generation with costs and outputs linked.

Most teams think they have “a model problem.” It is almost always a system problem.

Technical deep dive: what fails in production

Retrieval that looks good offline and misses in the wild

  • Embedding mismatch: You trained or validated on the wrong embedding model, then upgraded embeddings in prod. Cosine distances changed, recall tanked. No backfill strategy, so half your index is apples and half is oranges.
  • Query formulation: Users ask for “blue widget refund … policy July 2023.” Your retriever does a naive ANN query. It needs query rewriting, filters, and maybe a BM25 hybrid. Otherwise you retrieve irrelevant snippets and the LLM hallucinates to bridge gaps.
  • Chunking and context tax: Aggressive chunking bloats context. Top 8 chunks yield too much noise. P95 token count doubles and output quality drops.

Prompt and tool contracts that degrade silently

  • Prompt drift: Product teams tweak system prompts. No versioning, no diff, no linkage to performance. You see a slow quality decay that feels random.
  • JSON output without schema enforcement: The LLM returns almost valid JSON. Your parser is strict. Retry storms start. Or worse, you silently drop fields.
  • Tooling contracts: Function calling schemas change. Downstream tools expect snake_case. The model returns camelCase. You spend a week debugging what looks like “LLM unreliability” but is a schema mismatch.

Autoscaling and rate limits that create feedback loops

  • Cold starts: You scale the app layer fast, but not the retriever or the feature store. The queue backs up. Users retry. You hit provider rate limits. Retries pile on. Now everything is slow.
  • Global retry knobs: One exponential backoff setting shared everywhere. You double traffic during incidents and burn your vendor quota.
  • Lack of isolation: Background jobs and interactive traffic share the same pool and limits. Batch jobs steal your P95 budget.

Tail latency and variability

  • No per-stage budgets: The RAG pipeline has five hops and three models. Nobody owns a latency budget per hop. You target 2.5 seconds end to end and end up with 5.
  • Over-serialization: Synchronous calls that could be parallel. Most pipelines can prefetch embeddings and tools while streaming the first tokens.

Evaluation and offline metrics that lie

  • Golden sets too small or too clean: You have 150 examples that the team knows by heart. They are now unit tests, not evaluation.
  • Wrong metrics: You track only answer correctness. You ignore grounding coverage, refusal rates, or action success. You cherry-pick demos.

Practical fixes that actually work

Put SLOs on the system, not just the model

  • Define budgets per stage: retrieval 250 ms P95, reasoning model 1.2 s P95, re-rank 150 ms P95. Enforce with timeouts and fallbacks.
  • Track useful outcomes: cost per successful task, grounded answer rate, tool success rate, and containment rate. Quality without cost is a demo. Cost without quality is churn.

Make retrieval boring and predictable

  • Version embeddings intentionally: keep model_id per vector. Backfill new embeddings in a side index. Shadow-traffic evaluate, then flip routing per tenant or task.
  • Use hybrid retrieval: ANN plus keyword/BM25 for long-tail queries. Add a simple reranker with a strict latency cap.
  • Keep chunks tight: 200 to 500 tokens per chunk is fine for most domains. Use a max-context budget, not only top-k.
  • Log retrieval truth: store query, filters, top-k ids, distances, and final chosen context with a trace id. Makes debugging fast.

Treat prompts and tools like code

  • Prompt versioning: put templates in git with ids. Every request logs prompt_id, parameters, and model version. Roll back like you would code.
  • Enforce output schemas: use JSON Schema guided generation or function calling. Add tolerant parsers and repair steps. Record parse_error_rate by model and prompt version.
  • Contract tests for tools: before deploy, run contract tests that call the model with fixed seeds and validate the tool payloads.

Build for vendor volatility

  • Multi-endpoint routing: same model across 2 to 3 regions or vendors. Health check and cost-aware routing. Per-endpoint circuit breakers.
  • Degrade gracefully: if re-rank times out, skip it. If retrieve is slow, reduce top-k. If the main model is down, route to a cheaper fallback with tighter instructions.

Control retries and backpressure

  • Idempotency keys across the stack: dedupe at the queue and at your API gateway. Retries should not multiply work.
  • Separate queues and pools: interactive traffic on a higher priority queue and isolated workers. Batch on a different pool and limit.
  • Token-bucket rate limiting per tenant and per tool. Keep a fast, in-memory limiter at the edge.

Latency and cost hygiene

  • Stream early: start streaming output while fetching secondary facts or tools. Most users prefer 1.5 s to first token and 3.5 s total over a silent 2.8 s.
  • Cache aggressively where it matters: prompt+retrieval cache with a short TTL can cut cost 20 to 40 percent on repeated intents. Layer: in-memory for milliseconds, Redis for seconds, vector cache for minutes.
  • Model multiplexing: route short queries to small models, long reasoning to bigger ones. Use a lightweight classifier or rule to pick. Cap max tokens.

Observability that actually helps you fix things

  • Trace at the request level: one trace id from API to retrieval to model to tools. Include latencies, token counts, provider endpoint, cache hits, and cost per stage.
  • Prompt fingerprinting: hash of the final rendered prompt. When quality dips, you can correlate to the exact template change.
  • Retrieval quality dashboards: recall@k on known-answer sets, grounding coverage per domain, and a drift chart when embeddings or chunking change.
  • Post-deploy canaries: 1 to 5 percent of traffic on new prompt or model versions with automated alarms on P95, cost per success, and refusal rate.

Evaluation that resists optimism

  • Build a messy eval set: adversarial queries, long-tail, screenshots turned to OCR text, typos. Curate per domain.
  • Online checks: automated judges are fine for trend detection, not truth. Sample and review weekly. Tie feedback to traces.
  • Track containment: for assistants, measure tasks completed without human escalation. That metric pays the bills.

Business impact with real numbers

  • Unit economics: If your average request uses 2.5k input tokens and 800 output tokens on a mid-tier model at 3 dollars per 1M tokens in and 15 dollars per 1M out, you are at roughly 0.0095 dollars per request. Add retrieval and re-rank at 0.002 dollars. Now add 20 percent retry overhead and 30 percent cache hit rate. Your real blended cost swings between 0.007 and 0.015 dollars per request. That variance decides gross margin at scale.
  • Latency and conversion: Moving P99 from 7.8 s to 3.2 s typically lifts completion and NPS. In one case we cut P99 to 2.1 s by capping context, parallelizing tools, and adding a warm pool. That alone paid for the rework in two weeks.
  • Vendor risk: A single-region outage can wipe a day’s revenue. Routing across two providers with circuit breakers added 0.002 dollars per request in baseline cost and saved a full incident last quarter.

Key takeaways

  • Your architecture is not wrong. It is missing budgets, guardrails, and observability.
  • Retrieval quality is a first-class SLO. Treat embeddings and chunking as versions with migration plans.
  • Prompts and tool schemas need the same discipline as APIs. Version them and test contracts.
  • Control retries and queueing or you will amplify every minor incident.
  • Track cost per successful task, not just per request. That is the number the CFO cares about.
  • Proactive tracing and canaries beat postmortems.

If you want a second set of eyes

If this felt uncomfortably familiar, good. These are fixable problems. I help teams put budgets, routing, and observability in place so the nice diagram turns into a reliable system. If you are hitting tail latency, erratic quality, or unpredictable costs, this is exactly the kind of thing I fix when systems buckle at scale.