Scaling GenAI from PoC to Production: What Breaks and How to Fix It

The uncomfortable gap between a great demo and a stable product

The PoC nails a few curated prompts. The team celebrates. Two weeks later the first production users show up and everything slows, costs spike, and the model gives confident nonsense the moment someone asks outside the happy path. I see this pattern in almost every GenAI rollout. The PoC was a toy. Production is an ecosystem.

Where it goes wrong and why

Here are the recurring failure points I keep finding:

Latency whiplash: p50 looks fine in the lab, p95 in prod is 5 to 8 seconds because retrieval, tool calls, and guardrails all stack. Most teams only measured the model call.
Unbounded context: prompt builders keep appending system messages, safety preambles, and 20 chunks of context. Token counts explode, cost doubles, latency follows.
Retrieval that “passes a demo” but fails at recall: poor chunking, missing metadata filters, and no re-ranking. Users ask slightly different questions and get off-target answers.
No versioning discipline: prompts, tools, and routing logic change without traceability. Rollbacks are guesswork.
Provider roulette: a single LLM vendor outage or rate limit stalls your product because there is no routing, retry, or fallback plan.
Missing observability: logs have free text and screenshots instead of structured spans with token counts, citations, and model choices. You cannot debug or run proper A/Bs.

Why this happens:

PoCs are optimized for a demo script, not a latency budget or SLO.
Teams underestimate orchestration costs. The model call is the tip of the iceberg.
Lack of golden test sets. Without a stable yardstick, you chase vibes.
Product pressure. People ship before the evaluation and release process exist.

What most teams misunderstand:

RAG is not a single component. Indexing strategy, chunking, metadata hygiene, and re-ranking matter more than the vector DB brand.
“A stronger model will fix it” is only sometimes true. Routing, caching, and retrieval quality usually move the needle more and cost less.
Tool calling is a reliability problem, not just a capability. Idempotency and timeouts beat clever prompt tricks.

Deep dive: the production shape of a GenAI system

Here is the baseline architecture I deploy for real products, not prototypes:

Edge: API gateway with per-tenant auth, quotas, and feature flags. Request-level IDs.
Orchestrator: prompt construction, retrieval, tool calling, model routing, tracing. Use a workflow engine or a thin orchestrator library, not ad hoc glue.
Retrieval: hybrid search (BM25 + embeddings), structured metadata filters, optional re-ranker model. Separate read and write indexes.
Models: at least two providers, plus a small fast model for triage. Consistent interface and strict timeouts.
Caches: exact cache for deterministic prompts, semantic cache for near-duplicates, and a document-level cache for snippets.
Safety and compliance: PII redaction pre-index and pre-prompt, content policies both pre and post generation.
Observability: OpenTelemetry spans, structured events with prompt hash, model name, tokens in/out, retrieval stats, safety flags, and p95s per step.
Data loop: human feedback store, golden set generator, offline eval runner, online A/B harness.

Trade-offs you should decide explicitly:

Hosted vs self-hosted models: hosted for speed of iteration and elasticity, self-hosted when you have stable traffic and strict data controls. Break-even usually appears north of 50 to 100 tokens per second sustained with predictable load.
Retrieval vs fine-tuning: RAG first for changing corpora. Fine-tune when task style is stable and you need latency reduction or deterministic formatting. Often both.
Sync vs async: interactive flows should stream partials and cap p95 below 3 seconds. Everything else should be a job with callbacks or websockets.
Indexing strategy: small granular chunks help recall but hurt precision and cost. I usually start with 400 to 800 tokens per chunk, overlap 60 to 100 tokens, then add a cross-encoder re-ranker.

Common failure modes:

Cascading timeouts: provider throttle causes slow retries which hold connections and trip your own timeouts.
Context poisoning: wrong or duplicated snippets get injected due to poor de-duplication. Model sounds confident but cites the same paragraph three times.
Token blowups: internal templates quietly add thousands of tokens. Multiply by N retrieved docs and you pay for it.
Tool call deadlocks: external APIs do not respond, the LLM keeps trying alternative plans, and you loop without a hard cap.

Practical fixes that consistently work

Set real budgets, then design to them

Latency budget: allocate time per step. Example for interactive Q&A p95 3.0s budget:
- Retrieval 400 ms
- Re-rank 300 ms
- LLM generation 1.8 s
- Safety and formatting 300 ms
- Overhead 200 ms
Cost budget: set max tokens in and out per request. Enforce server-side, not just by convention.
Availability SLO: 99.9 percent by quarter with error budgets. This forces fallbacks and graceful degradation.

Make routing a first-class feature

Use a small classifier to tag question type and difficulty. If retrieval confidence is high and format is simple, route to a smaller cheaper model. If low confidence or complex tool use, escalate.
Add retry tiers by provider with strict deadlines. Example: 800 ms to primary, 400 ms to secondary, then degrade to a fast summary.
Expect 20 to 50 percent cost reduction once routing plus caching are stable.

Get retrieval right before tuning anything else

Chunk by structure, not only token count. Respect headings, tables, and lists.
Hybrid search by default. Pure embeddings miss exact term queries and numbers.
Re-rank top 50 hits with a cross-encoder. The lift on groundedness is usually obvious in logs.
Add metadata: source, timestamp, access rights. Use filters to keep the context small and relevant.
Enforce dedupe on near-identical chunks so you do not pay to stuff repeats.

Control tokens like you control memory in a hot path

Server-side truncation of context with a sliding window and relevance decay.
Stop sequences for formats to avoid rambling. Keep output length predictable.
Summarize long tool outputs before handing them back to the model.
Pre-render stable system prompts once and store a hash. Do not rebuild string blobs on every call.

Caching that actually helps

Exact cache for idempotent tasks like classification or extraction with fixed prompts.
Semantic cache keyed on a normalized query and a content hash. Evict on index updates.
Snippet cache for heavy re-rankers and slow sources. Invalidate by document version.

Reliability patterns

Timeouts and circuit breakers per provider and per tool. Backoff with jitter and a hard cap on attempts.
Idempotency keys on tool calls so retries do not duplicate side effects.
Fallback graphs: when retrieval is empty, answer with a clarifying question or a fast search rather than hallucinating.
Kill switch for new prompt versions using feature flags. Rollbacks in seconds, not hours.

Observability and evaluation that pay for themselves

Trace every step with spans: retrieval candidates, chosen snippets, token counts, routing decision, model latency.
Store prompt and response hashes with version IDs. Keep raw text behind an access gate for privacy.
Offline golden sets: 200 to 1000 prompts per use case, curated and versioned. Score for correctness, citation accuracy, format.
Online A/B: ship new prompt or router logic to 5 to 10 percent traffic. Watch p95 latency, containment rate, user edits, and support tickets. Kill fast if regressions pop.

Security and compliance early, not later

PII redaction before indexing and again before prompting. Encrypt embeddings at rest.
Tenant isolation at the index and cache layers. No shared keys that let a misconfig bleed data across customers.
Moderation in and out for public-facing flows. Keep logs safe and scrubbed.

Release process that beats fire drills

Shadow mode first, then canary, then gradual rollout. Tie every release to versioned prompts and routing rules.
Backward-compatible tool schemas for at least one release wave. Add fields with defaults, do not rename blindly.

Business impact in hard numbers

Here is what I typically see after these changes:

Cost: 30 to 60 percent reduction from routing, caching, and token caps. Another 10 to 20 percent with retrieval tightening and summarization.
Latency: p95 down by 35 to 50 percent when you enforce step budgets and add streaming. Perceived speed often improves more.
Reliability: user-visible error rate cut by 50 to 80 percent once you add circuit breakers and fallback graphs.
Headcount: less flailing. A small platform team can support multiple product teams because releases and evals are standardized.

This is not theory. These are repeatable gains when you move from a prompt playground to a system with budgets and guardrails.

Key takeaways

Treat LLM calls as one node in a graph, not the product. Orchestration, retrieval, and safety drive most of the risk.
Set hard budgets for latency and tokens. Design backward from them.
Build model routing and provider failover on day one of production work.
Fix retrieval quality before tuning models. Re-ranking and metadata filters pay off fast.
Version prompts, tools, and routing. Ship behind flags. Keep a kill switch.
Observe everything with structured spans. If you cannot see it, you cannot improve it.

If you are hitting these walls

If your PoC stalls at real traffic, or your costs are outpacing adoption, I help teams put these systems in place and stop firefighting. Happy to look at your traces, routing logic, or retrieval setup and give a blunt assessment. This is exactly the kind of thing I fix when systems start breaking at scale.

Architect's Brief

Scaling GenAI from PoC to Production: What Breaks and How to Fix It

The uncomfortable gap between a great demo and a stable product

Where it goes wrong and why

Deep dive: the production shape of a GenAI system

Practical fixes that consistently work

Set real budgets, then design to them

Make routing a first-class feature

Get retrieval right before tuning anything else

Control tokens like you control memory in a hot path

Caching that actually helps

Reliability patterns

Observability and evaluation that pay for themselves

Security and compliance early, not later

Release process that beats fire drills

Business impact in hard numbers

Key takeaways

If you are hitting these walls

Category Name

Why Most RAG Architectures Break Under Real User Load

Why Your RAG System Retrieves the Wrong Data (and How to Fix It)

The real cost breakdown of running LLM apps on AWS

Recent Posts

Why Most RAG Architectures Break Under Real User Load

Why Your RAG System Retrieves the Wrong Data (and How to Fix It)

The real cost breakdown of running LLM apps on AWS

AI Observability: Stop Guessing, Start Instrumenting

Categories

Why Most RAG Architectures Break Under Real User Load

Why Your RAG System Retrieves the Wrong Data (and How to Fix It)

The real cost breakdown of running LLM apps on AWS

Why Most RAG Architectures Break Under Real User Load

Why Your RAG System Retrieves the Wrong Data (and How to Fix It)

The real cost breakdown of running LLM apps on AWS