{"id":32,"date":"2025-08-11T09:37:24","date_gmt":"2025-08-11T09:37:24","guid":{"rendered":"https:\/\/angirash.in\/blog\/2025\/08\/11\/scaling-genai-from-poc-to-production\/"},"modified":"2026-04-10T19:29:24","modified_gmt":"2026-04-10T19:29:24","slug":"scaling-genai-from-poc-to-production","status":"publish","type":"post","link":"https:\/\/angirash.in\/blog\/2025\/08\/11\/scaling-genai-from-poc-to-production\/","title":{"rendered":"Scaling GenAI from PoC to Production: What Breaks and How to Fix It"},"content":{"rendered":"<h2>The uncomfortable gap between a great demo and a stable product<\/h2>\n<p>The PoC nails a few curated prompts. The team celebrates. Two weeks later the <a href=\"https:\/\/angirash.in\/blog\/2025\/09\/09\/why-most-enterprise-ai-pilots-fail\/\">first production users show up<\/a> and everything slows, costs spike, and the model gives confident nonsense the moment someone asks outside the happy path. I see this pattern in almost every GenAI rollout. <a href=\"https:\/\/angirash.in\/blog\/2025\/07\/22\/ai-demo-trap-business-value\/\">The PoC was a toy. Production is an ecosystem<\/a>.<\/p>\n<h2>Where it goes wrong and why<\/h2>\n<p>Here are the recurring failure points I keep finding:<\/p>\n<ul>\n<li>Latency whiplash: p50 looks fine in the lab, p95 in prod is 5 to 8 seconds because <a href=\"https:\/\/angirash.in\/blog\/2025\/04\/21\/llm-latency-in-production-what-actually-works\/\">retrieval, tool calls, and guardrails all stack<\/a>. Most teams only measured the model call.<\/li>\n<li>Unbounded context: prompt builders keep appending system messages, safety preambles, and 20 chunks of context. Token counts explode, cost doubles, latency follows.<\/li>\n<li>Retrieval that \u201cpasses a demo\u201d but fails at recall: <a href=\"https:\/\/angirash.in\/blog\/2025\/02\/24\/chunking-that-actually-improves-retrieval\/\">poor chunking, missing metadata filters, and no re<\/a>-ranking. Users ask slightly different questions and get off-target answers.<\/li>\n<li><a href=\"https:\/\/angirash.in\/blog\/2025\/03\/18\/versioning-in-llm-systems-what-actually-matters\/\">No versioning discipline<\/a>: prompts, tools, and routing logic change without traceability. Rollbacks are guesswork.<\/li>\n<li>Provider roulette: a single LLM vendor outage or rate limit stalls your product because there is no routing, retry, or fallback plan.<\/li>\n<li><a href=\"https:\/\/angirash.in\/blog\/2025\/11\/12\/ai-observability-stop-guessing-start-instrumenting\/\">Missing observability: logs have free text and screenshots<\/a> instead of structured spans with token counts, citations, and model choices. You cannot debug or run proper A\/Bs.<\/li>\n<\/ul>\n<p>Why this happens:<\/p>\n<ul>\n<li>PoCs are optimized for a demo script, not a latency budget or SLO.<\/li>\n<li>Teams underestimate orchestration costs. The model call is the tip of the iceberg.<\/li>\n<li>Lack of golden test sets. Without a stable yardstick, you chase vibes.<\/li>\n<li>Product pressure. People ship before the evaluation and release process exist.<\/li>\n<\/ul>\n<p>What most teams misunderstand:<\/p>\n<ul>\n<li>RAG is not a single component. Indexing strategy, chunking, metadata hygiene, and re-ranking matter more than the vector DB brand.<\/li>\n<li>\u201cA stronger model will fix it\u201d is only sometimes true. Routing, caching, and retrieval quality usually move the needle more and cost less.<\/li>\n<li>Tool calling is a reliability problem, not just a capability. Idempotency and timeouts beat clever prompt tricks.<\/li>\n<\/ul>\n<h2>Deep dive: the production shape of a GenAI system<\/h2>\n<p>Here is the baseline architecture I deploy for real products, not prototypes:<\/p>\n<ul>\n<li>Edge: API gateway with per-tenant auth, quotas, and feature flags. Request-level IDs.<\/li>\n<li>Orchestrator: prompt construction, retrieval, tool calling, model routing, tracing. Use a workflow engine or a thin orchestrator library, not ad hoc glue.<\/li>\n<li>Retrieval: hybrid search (BM25 + embeddings), structured metadata filters, optional re-ranker model. Separate read and write indexes.<\/li>\n<li>Models: at least two providers, plus a small fast model for triage. Consistent interface and strict timeouts.<\/li>\n<li><a href=\"https:\/\/angirash.in\/blog\/2025\/07\/14\/caching-strategies-for-llm-systems-that-actually-work\/\">Caches: exact cache for deterministic prompts, semantic cache for near-duplicates<\/a>, and a document-level cache for snippets.<\/li>\n<li>Safety and compliance: PII redaction pre-index and pre-prompt, content policies both pre and post generation.<\/li>\n<li>Observability: OpenTelemetry spans, structured events with prompt hash, model name, tokens in\/out, retrieval stats, safety flags, and p95s per step.<\/li>\n<li>Data loop: human feedback store, golden set generator, offline eval runner, online A\/B harness.<\/li>\n<\/ul>\n<p>Trade-offs you should decide explicitly:<\/p>\n<ul>\n<li>Hosted vs self-hosted models: hosted for speed of iteration and elasticity, self-hosted when you have stable traffic and strict data controls. Break-even usually appears north of 50 to 100 tokens per second sustained with predictable load.<\/li>\n<li>Retrieval vs fine-tuning: RAG first for changing corpora. Fine-tune when task style is stable and you need latency reduction or deterministic formatting. Often both.<\/li>\n<li>Sync vs async: interactive flows should stream partials and cap p95 below 3 seconds. Everything else should be a job with callbacks or websockets.<\/li>\n<li>Indexing strategy: small granular chunks help recall but hurt precision and cost. I usually start with 400 to 800 tokens per chunk, overlap 60 to 100 tokens, then add a cross-encoder re-ranker.<\/li>\n<\/ul>\n<p>Common failure modes:<\/p>\n<ul>\n<li>Cascading timeouts: provider throttle causes slow retries which hold connections and trip your own timeouts.<\/li>\n<li>Context poisoning: wrong or duplicated snippets get injected due to poor de-duplication. Model sounds confident but cites the same paragraph three times.<\/li>\n<li>Token blowups: internal templates quietly add thousands of tokens. Multiply by N retrieved docs and you pay for it.<\/li>\n<li>Tool call deadlocks: external APIs do not respond, the LLM keeps trying alternative plans, and you loop without a hard cap.<\/li>\n<\/ul>\n<h2>Practical fixes that consistently work<\/h2>\n<h3>Set real budgets, then design to them<\/h3>\n<ul>\n<li>Latency budget: allocate time per step. Example for interactive Q&amp;A p95 3.0s budget:\n<ul>\n<li>Retrieval 400 ms<\/li>\n<li>Re-rank 300 ms<\/li>\n<li>LLM generation 1.8 s<\/li>\n<li>Safety and formatting 300 ms<\/li>\n<li>Overhead 200 ms<\/li>\n<\/ul>\n<\/li>\n<li>Cost budget: set max tokens in and out per request. Enforce server-side, not just by convention.<\/li>\n<li>Availability SLO: 99.9 percent by quarter with error budgets. This forces fallbacks and graceful degradation.<\/li>\n<\/ul>\n<h3>Make routing a first-class feature<\/h3>\n<ul>\n<li>Use a small classifier to tag question type and difficulty. If retrieval confidence is high and format is simple, route to a smaller cheaper model. If low confidence or complex tool use, escalate.<\/li>\n<li>Add retry tiers by provider with strict deadlines. Example: 800 ms to primary, 400 ms to secondary, then degrade to a fast summary.<\/li>\n<li>Expect 20 to 50 percent cost reduction once routing plus caching are stable.<\/li>\n<\/ul>\n<h3>Get retrieval right before tuning anything else<\/h3>\n<ul>\n<li>Chunk by structure, not only token count. Respect headings, tables, and lists.<\/li>\n<li><a href=\"https:\/\/angirash.in\/blog\/2025\/10\/02\/hybrid-search-vs-vector-search-production\/\">Hybrid search by default<\/a>. Pure embeddings miss exact term queries and numbers.<\/li>\n<li>Re-rank top 50 hits with a cross-encoder. The lift on groundedness is usually obvious in logs.<\/li>\n<li>Add metadata: source, timestamp, access rights. Use filters to keep the context small and relevant.<\/li>\n<li>Enforce dedupe on near-identical chunks so you do not pay to stuff repeats.<\/li>\n<\/ul>\n<h3>Control tokens like you control memory in a hot path<\/h3>\n<ul>\n<li>Server-side truncation of context with a sliding window and relevance decay.<\/li>\n<li>Stop sequences for formats to avoid rambling. Keep output length predictable.<\/li>\n<li>Summarize long tool outputs before handing them back to the model.<\/li>\n<li>Pre-render stable system prompts once and store a hash. Do not rebuild string blobs on every call.<\/li>\n<\/ul>\n<h3>Caching that actually helps<\/h3>\n<ul>\n<li>Exact cache for idempotent tasks like classification or extraction with fixed prompts.<\/li>\n<li>Semantic cache keyed on a normalized query and a content hash. Evict on index updates.<\/li>\n<li>Snippet cache for heavy re-rankers and slow sources. Invalidate by document version.<\/li>\n<\/ul>\n<h3>Reliability patterns<\/h3>\n<ul>\n<li>Timeouts and circuit breakers per provider and per tool. Backoff with jitter and a hard cap on attempts.<\/li>\n<li>Idempotency keys on tool calls so retries do not duplicate side effects.<\/li>\n<li>Fallback graphs: when retrieval is empty, answer with a clarifying question or a fast search rather than hallucinating.<\/li>\n<li>Kill switch for new prompt versions using feature flags. Rollbacks in seconds, not hours.<\/li>\n<\/ul>\n<h3>Observability and evaluation that pay for themselves<\/h3>\n<ul>\n<li>Trace every step with spans: retrieval candidates, chosen snippets, token counts, routing decision, model latency.<\/li>\n<li>Store prompt and response hashes with version IDs. Keep raw text behind an access gate for privacy.<\/li>\n<li>Offline golden sets: 200 to 1000 prompts per use case, curated and versioned. Score for correctness, citation accuracy, format.<\/li>\n<li>Online A\/B: ship new prompt or router logic to 5 to 10 percent traffic. Watch p95 latency, containment rate, user edits, and support tickets. Kill fast if regressions pop.<\/li>\n<\/ul>\n<h3>Security and compliance early, not later<\/h3>\n<ul>\n<li>PII redaction before indexing and again before prompting. Encrypt embeddings at rest.<\/li>\n<li>Tenant isolation at the index and cache layers. No shared keys that let a misconfig bleed data across customers.<\/li>\n<li>Moderation in and out for public-facing flows. Keep logs safe and scrubbed.<\/li>\n<\/ul>\n<h3>Release process that beats fire drills<\/h3>\n<ul>\n<li>Shadow mode first, then canary, then gradual rollout. Tie every release to versioned prompts and routing rules.<\/li>\n<li>Backward-compatible tool schemas for at least one release wave. Add fields with defaults, do not rename blindly.<\/li>\n<\/ul>\n<h2>Business impact in hard numbers<\/h2>\n<p>Here is what I typically see after these changes:<\/p>\n<ul>\n<li>Cost: 30 to 60 percent reduction from routing, caching, and token caps. Another 10 to 20 percent with retrieval tightening and summarization.<\/li>\n<li>Latency: p95 down by 35 to 50 percent when you enforce step budgets and add streaming. Perceived speed often improves more.<\/li>\n<li>Reliability: user-visible error rate cut by 50 to 80 percent once you add circuit breakers and fallback graphs.<\/li>\n<li>Headcount: less flailing. A small platform team can support multiple product teams because releases and evals are standardized.<\/li>\n<\/ul>\n<p>This is not theory. These are repeatable gains when you move from a prompt playground to a system with budgets and guardrails.<\/p>\n<h2>Key takeaways<\/h2>\n<ul>\n<li>Treat LLM calls as one node in a graph, not the product. Orchestration, retrieval, and safety drive most of the risk.<\/li>\n<li>Set hard budgets for latency and tokens. Design backward from them.<\/li>\n<li>Build model routing and provider failover on day one of production work.<\/li>\n<li>Fix retrieval quality before tuning models. Re-ranking and metadata filters pay off fast.<\/li>\n<li>Version prompts, tools, and routing. Ship behind flags. Keep a kill switch.<\/li>\n<li>Observe everything with structured spans. If you cannot see it, you cannot improve it.<\/li>\n<\/ul>\n<h2>If you are hitting these walls<\/h2>\n<p>If your PoC stalls at real traffic, or your costs are outpacing adoption, I help teams put these systems in place and stop firefighting. Happy to look at your traces, routing logic, or retrieval setup and give a blunt assessment. This is exactly the kind of thing I fix when systems start breaking at scale.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The uncomfortable gap between a great demo and a stable product The PoC nails a few curated prompts. The team celebrates. Two weeks later the first production users show up&#8230; <\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[4],"tags":[17,16,21],"class_list":["post-32","post","type-post","status-publish","format-standard","hentry","category-genai-production","tag-ai-cost","tag-ai-scalability","tag-llmops"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/posts\/32","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/comments?post=32"}],"version-history":[{"count":2,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/posts\/32\/revisions"}],"predecessor-version":[{"id":198,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/posts\/32\/revisions\/198"}],"wp:attachment":[{"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/media?parent=32"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/categories?post=32"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/tags?post=32"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}