{"id":37,"date":"2025-03-22T14:36:11","date_gmt":"2025-03-22T14:36:11","guid":{"rendered":"https:\/\/angirash.in\/blog\/2025\/07\/22\/mlops-for-llms-what-actually-matters\/"},"modified":"2026-04-10T19:29:12","modified_gmt":"2026-04-10T19:29:12","slug":"mlops-for-llms-what-actually-matters","status":"publish","type":"post","link":"https:\/\/angirash.in\/blog\/2025\/03\/22\/mlops-for-llms-what-actually-matters\/","title":{"rendered":"MLOps for LLMs: What Actually Matters in Production"},"content":{"rendered":"<h2>The ugly part of LLMs: the system works until it silently doesn&#8217;t<\/h2>\n<p>If your first LLM feature went live and then support tickets tripled, latency wandered, and your cloud bill became a surprise every morning, you are not alone. <a href=\"https:\/\/angirash.in\/blog\/2025\/08\/14\/why-ai-teams-struggle-without-system-design-mindset\/\">The gap is not the model. The gap is everything around it<\/a>. Most teams still run LLMs like slightly fancier classifiers. In production, the change surface is bigger than classic ML: prompts, tools, retrieval, model vendors, policies, and traffic patterns all drift on their own clocks.<\/p>\n<p>This post is the short list of what actually matters to keep LLM systems reliable, fast, and affordable.<\/p>\n<h2>Where teams get burned and why<\/h2>\n<ul>\n<li>Where it shows up\n<ul>\n<li>RAG chat and document QA: quality degrades with index staleness, hallucinations rise under load, cost spikes on long contexts<\/li>\n<li>Agents and tool use: schema drift, retries loop, hard to reproduce incidents<\/li>\n<li>Code assistants: structured output breaks on minor prompt edits, safety filters cause timeouts<\/li>\n<\/ul>\n<\/li>\n<li>Why it happens in real systems\n<ul>\n<li>Non-determinism plus vendor drift. Your provider updates a base model on Tuesday, your prompts behave differently on Wednesday<\/li>\n<li>Retrieval is a second model you operate. Most orgs treat it like a cache, not a living system<\/li>\n<li>Tooling adds combinatorial failure modes: partial tool failures, mismatched schemas, rate limits<\/li>\n<li>You are shipping prompts as product logic without versioning or tests<\/li>\n<\/ul>\n<\/li>\n<li>What most teams misunderstand\n<ul>\n<li>Chasing the newest model is not a strategy. Retrieval quality and prompt discipline usually dominate model choice for enterprise tasks<\/li>\n<li>Token cost is not linear with value. 30 to 50 percent of spend is often wasted on context that does not affect the answer<\/li>\n<li>LLM evaluation is not a single metric. You need a layered eval harness and <a href=\"https:\/\/angirash.in\/blog\/2025\/08\/14\/build-feedback-loops-into-ai-systems\/\">production feedback loop<\/a><\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h2>Technical deep dive: the real control points<\/h2>\n<h3>The request path that actually matters<\/h3>\n<p>Think of a typical production path:<br \/>\n1) Router and policy gates<br \/>\n2) Retrieval or context construction<br \/>\n3) Planner or tool decision<br \/>\n4) Tool execution and synthesis<br \/>\n5) Output shaping and safety<\/p>\n<p>For each step, define a contract you can trace, evaluate, and roll back.<\/p>\n<ul>\n<li>Router\n<ul>\n<li>Inputs: tenant, task type, budget, risk level<\/li>\n<li>Decision: cheap vs strong model, or deterministic fallback<\/li>\n<li>Failure modes: misrouting high difficulty tasks to cheap models, budget overruns<\/li>\n<\/ul>\n<\/li>\n<li>Retrieval\n<ul>\n<li>Inputs: queries, filters, index snapshot<\/li>\n<li>Decision: top-k, chunking, reranking, query rewriting<\/li>\n<li>Failure modes: stale index, low recall, over-stuffed contexts, citation mismatch<\/li>\n<\/ul>\n<\/li>\n<li>Tools\n<ul>\n<li>Inputs: function schemas, permissions<\/li>\n<li>Decision: which tool, how many calls, timeout budget<\/li>\n<li>Failure modes: schema drift, infinite tool loops, partial failures, sandbox escapes<\/li>\n<\/ul>\n<\/li>\n<li>Synthesis\n<ul>\n<li>Inputs: structured prompts, system instructions<\/li>\n<li>Decision: temperature, max tokens, stop sequences<\/li>\n<li>Failure modes: JSON malformation, verbosity explosions, instruction bleed<\/li>\n<\/ul>\n<\/li>\n<li><a href=\"https:\/\/angirash.in\/blog\/2025\/03\/18\/designing-ai-systems-for-reliability-not-just-accuracy\/\">Safety and output shaping<\/a>\n<ul>\n<li>Inputs: policies, PII and toxicity checks<\/li>\n<li>Decision: block, redact, or re-ask<\/li>\n<li>Failure modes: high false positive rate, latency spikes under load<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h3>Trade-offs you actually face<\/h3>\n<ul>\n<li>Model class vs retrieval quality\n<ul>\n<li>Strong model with weak retrieval often underperforms a smaller model with high recall and clean snippets<\/li>\n<\/ul>\n<\/li>\n<li>Context length vs cost and latency\n<ul>\n<li>Long context is not free. Past a point, more text adds noise and slows down sampling<\/li>\n<\/ul>\n<\/li>\n<li>Agents vs direct RAG\n<ul>\n<li>Agents give flexibility but multiply failure modes. If the task is answerable via single-hop RAG, do not introduce tools<\/li>\n<\/ul>\n<\/li>\n<li>JSON mode vs free text\n<ul>\n<li>Structured output simplifies downstream systems but can elevate failure rate if prompts are not tightly constrained<\/li>\n<\/ul>\n<\/li>\n<li>Centralized vs per-tenant indexes\n<ul>\n<li>Centralized is cheaper and easier to maintain. Per-tenant improves privacy and relevance but increases ops overhead<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h3>Failure modes I keep seeing<\/h3>\n<ul>\n<li><a href=\"https:\/\/angirash.in\/blog\/2025\/06\/15\/common-mistakes-in-ai-architecture-design\/\">Prompt drift: one helpful copy edit tanks output structure<\/a><\/li>\n<li><a href=\"https:\/\/angirash.in\/blog\/2025\/07\/14\/debugging-ai-systems-harder-than-software\/\">Vendor drift: provider silently updates models<\/a> and your regression tests light up<\/li>\n<li><a href=\"https:\/\/angirash.in\/blog\/2025\/02\/24\/chunking-that-actually-improves-retrieval\/\">Retriever rot: index is stale, synonyms missing, chunking wrong<\/a>, duplicate docs everywhere<\/li>\n<li>Tool schema drift: a field rename boots your agent into a retry loop<\/li>\n<li>Cost creep: longer contexts, nested retries, uncapped max tokens, forgotten streaming<\/li>\n<li>Evaluator brittleness: LLM-as-judge disagrees with users or flips decisions across versions<\/li>\n<\/ul>\n<h2>Practical solutions that hold up in production<\/h2>\n<h3>1) Treat the chain as code. Version everything<\/h3>\n<ul>\n<li><a href=\"https:\/\/angirash.in\/blog\/2025\/03\/18\/versioning-in-llm-systems-what-actually-matters\/\">Version IDs for: prompts, system instructions, tools and their schemas<\/a>, retriever configuration (k, chunk, rerank), index snapshot hash, model and parameters, safety policies<\/li>\n<li>Attach a run-recipe to every request in logs: all the above plus request_id and tenant<\/li>\n<li>Store prompts and tool schemas in git with code review. Any change ships behind a flag<\/li>\n<\/ul>\n<h3>2) Build a layered evaluation harness<\/h3>\n<ul>\n<li><a href=\"https:\/\/angirash.in\/blog\/2025\/03\/14\/why-your-ai-evaluation-metrics-are-misleading\/\">Offline goldens at three tiers<\/a>\n<ul>\n<li>Unit: structure-only checks. JSON validity, required fields, citation presence<\/li>\n<li>Task: accuracy on representative questions with canonical references<\/li>\n<li>Scenario: multi-turn sessions, adversarial queries, rate limits<\/li>\n<\/ul>\n<\/li>\n<li>Use LLM graders but calibrate them\n<ul>\n<li>Sample 10 to 20 percent of evals for human adjudication and compute inter-rater reliability<\/li>\n<li>Freeze the judge model for stability. If you upgrade the judge, re-baseline<\/li>\n<\/ul>\n<\/li>\n<li>Canary and shadow in prod\n<ul>\n<li>1 to 5 percent traffic to the new recipe. Compare win rates, cost, and latency against control per tenant<\/li>\n<li>Shadow route complex tasks only if you can bound cost and protect PII<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h3>3) Observability that is actually useful<\/h3>\n<ul>\n<li><a href=\"https:\/\/angirash.in\/blog\/2025\/11\/12\/ai-observability-stop-guessing-start-instrumenting\/\">Span-level tracing for each step<\/a> with consistent IDs. Include token counts, p50\/p95 latency, retry counts, error taxonomies<\/li>\n<li>Content-aware metrics without leaking PII\n<ul>\n<li>Redact at source. Hash or label entities before storage<\/li>\n<li>Log features, not raw text. For example: context length, citation count, tool types used<\/li>\n<\/ul>\n<\/li>\n<li>Drift dashboards\n<ul>\n<li>Win rate vs control, structure error rate, hallucination proxy rate, recall@k for retrieval, safety block rate<\/li>\n<\/ul>\n<\/li>\n<li>Incident triage playbook\n<ul>\n<li>Repro with run-recipe, snapshot index at time of failure, freeze vendor model version if possible<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h3>4) Release and rollback like you mean it<\/h3>\n<ul>\n<li>Feature flags by tenant, product surface, and task type<\/li>\n<li>Safe defaults and fallback paths\n<ul>\n<li>If JSON validation fails, re-ask once with structure-only prompt or fall back to deterministic template<\/li>\n<\/ul>\n<\/li>\n<li>Budget guards\n<ul>\n<li>Token and time budgets per request. Hard stops on max tokens and chain depth<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h3>5) Cost control that does not wreck quality<\/h3>\n<ul>\n<li>Route by difficulty\n<ul>\n<li>Start with a smaller model with strict budgets. Escalate only on uncertainty or failed structure checks<\/li>\n<\/ul>\n<\/li>\n<li>Prune context\n<ul>\n<li>Deduplicate chunks, cap per-source tokens, prefer rerankers over stuffing<\/li>\n<\/ul>\n<\/li>\n<li>Cache wisely\n<ul>\n<li>Embed and retrieval caches with TTLs are safe. Response caching works only for deterministic, non-personalized tasks<\/li>\n<\/ul>\n<\/li>\n<li>Prompt compaction\n<ul>\n<li>Replace verbose policy text with short, tested rubrics. Move repetitive instructions into system prompts or tool descriptions<\/li>\n<\/ul>\n<\/li>\n<li>Stop early\n<ul>\n<li>Tight stop sequences, low temperature, reasonable max tokens. Stream responses to improve perceived latency<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h3>6) RAG discipline beats shiny models<\/h3>\n<ul>\n<li>Evaluate retrieval separately\n<ul>\n<li>Track recall@k against labeled queries. If recall is below 0.8, <a href=\"https:\/\/angirash.in\/blog\/2025\/05\/08\/when-not-to-use-rag\/\">fix RAG before touching the generator<\/a><\/li>\n<\/ul>\n<\/li>\n<li>Index hygiene\n<ul>\n<li>Chunk size matched to query granularity, aggressive dedupe, metadata filters that reflect real user constraints<\/li>\n<\/ul>\n<\/li>\n<li>Freshness and rebuilds\n<ul>\n<li>Incremental indexing with delta feeds. Alert on ingestion failures. Rebuild cadence tied to content volatility<\/li>\n<\/ul>\n<\/li>\n<li>Query rewriting vs over-stuffing\n<ul>\n<li>Lightweight rewrites often beat 100k token contexts<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h3>7) Reliability and safety as engineering, not vibes<\/h3>\n<ul>\n<li>Structured output guarantees\n<ul>\n<li>JSON schema or function calling with strict validators. Automatic repair with a single re-ask budget<\/li>\n<\/ul>\n<\/li>\n<li>Timeouts, retries, and idempotency for tools\n<ul>\n<li>Exponential backoff with caps. Circuit breakers for flaky integrations<\/li>\n<\/ul>\n<\/li>\n<li>Guardrails with measured impact\n<ul>\n<li>Content filters that run in parallel, not serial, to keep latency stable. Track false positives to avoid user pain<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h3>8) Team process that survives growth<\/h3>\n<ul>\n<li>Single owner for the LLM recipe. Weekly drift review across prompts, tools, and retrieval<\/li>\n<li>Vendor update watch\n<ul>\n<li>Pin versions where possible. If not, catch deltas in nightly batch evals<\/li>\n<\/ul>\n<\/li>\n<li>Keep product and ops aligned\n<ul>\n<li>Share win rates, cost per task, and top failure themes with PMs. Kill nice-to-haves that burn tokens without moving metrics<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h2>Business impact you can expect if you do this well<\/h2>\n<ul>\n<li>Cost\n<ul>\n<li>20 to 40 percent reduction from routing, pruning, and budgets, without noticeable quality loss<\/li>\n<li>Another 10 to 20 percent from retrieval and embedding caching and dedupe<\/li>\n<\/ul>\n<\/li>\n<li>Performance\n<ul>\n<li>p95 latency stabilizes once you cap chain depth, parallelize safety checks, and remove over-long contexts<\/li>\n<\/ul>\n<\/li>\n<li>Quality and risk\n<ul>\n<li>Most hallucinations in enterprise RAG track to retrieval recall and snippet cleanliness. Fixing those drops escalations materially<\/li>\n<li>Canary and rollback stop week-long outages from vendor drifts and prompt accidents<\/li>\n<\/ul>\n<\/li>\n<li>Scaling risk\n<ul>\n<li>As tenant count grows, versioning and flags prevent cross-tenant regressions. You will not be firefighting silent breakages at midnight<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h2>Key takeaways<\/h2>\n<ul>\n<li>Version prompts, tools, models, and index snapshots like code. Attach run-recipe metadata to every call<\/li>\n<li>Build a layered eval harness with offline goldens, LLM graders you calibrate, and canary in prod<\/li>\n<li>Observe at the span level. Track token, latency, and structure failures, not just success flags<\/li>\n<li>Control cost via routing by difficulty, context pruning, and strict budgets. Caching helps, but only where deterministic<\/li>\n<li>Fix retrieval before changing models. RAG quality dominates outcomes in most enterprise cases<\/li>\n<li>Ship behind flags, roll out by tenant, and keep a deterministic fallback<\/li>\n<\/ul>\n<h2>If you need a hand<\/h2>\n<p>If parts of this feel familiar, that is normal. The fixes are repeatable but require discipline across product, data, and infra. If you&#8217;re running into drift, random outages, or a bill that keeps climbing, this is exactly the kind of thing I help teams fix when systems start breaking at scale.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The ugly part of LLMs: the system works until it silently doesn&#8217;t If your first LLM feature went live and then support tickets tripled, latency wandered, and your cloud bill&#8230; <\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[6],"tags":[17,22,21],"class_list":["post-37","post","type-post","status-publish","format-standard","hentry","category-mlops-llmops","tag-ai-cost","tag-ai-observability","tag-llmops"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/posts\/37","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/comments?post=37"}],"version-history":[{"count":4,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/posts\/37\/revisions"}],"predecessor-version":[{"id":192,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/posts\/37\/revisions\/192"}],"wp:attachment":[{"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/media?parent=37"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/categories?post=37"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/tags?post=37"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}