{"id":37,"date":"2025-03-22T14:36:11","date_gmt":"2025-03-22T14:36:11","guid":{"rendered":"https:\/\/angirash.in\/blog\/2025\/07\/22\/mlops-for-llms-what-actually-matters\/"},"modified":"2026-04-09T23:26:06","modified_gmt":"2026-04-09T23:26:06","slug":"mlops-for-llms-what-actually-matters","status":"publish","type":"post","link":"https:\/\/angirash.in\/blog\/2025\/03\/22\/mlops-for-llms-what-actually-matters\/","title":{"rendered":"MLOps for LLMs: What Actually Matters in Production"},"content":{"rendered":"<h2>The ugly part of LLMs: the system works until it silently doesn&#8217;t<\/h2>\n<p>If your first LLM feature went live and then support tickets tripled, latency wandered, and your cloud bill became a surprise every morning, you are not alone. The gap is not the model. The gap is everything around it. Most teams still run LLMs like slightly fancier classifiers. In production, the change surface is bigger than classic ML: prompts, tools, retrieval, model vendors, policies, and traffic patterns all drift on their own clocks.<\/p>\n<p>This post is the short list of what actually matters to keep LLM systems reliable, fast, and affordable.<\/p>\n<h2>Where teams get burned and why<\/h2>\n<ul>\n<li>Where it shows up\n<ul>\n<li>RAG chat and document QA: quality degrades with index staleness, hallucinations rise under load, cost spikes on long contexts<\/li>\n<li>Agents and tool use: schema drift, retries loop, hard to reproduce incidents<\/li>\n<li>Code assistants: structured output breaks on minor prompt edits, safety filters cause timeouts<\/li>\n<\/ul>\n<\/li>\n<li>Why it happens in real systems\n<ul>\n<li>Non-determinism plus vendor drift. Your provider updates a base model on Tuesday, your prompts behave differently on Wednesday<\/li>\n<li>Retrieval is a second model you operate. Most orgs treat it like a cache, not a living system<\/li>\n<li>Tooling adds combinatorial failure modes: partial tool failures, mismatched schemas, rate limits<\/li>\n<li>You are shipping prompts as product logic without versioning or tests<\/li>\n<\/ul>\n<\/li>\n<li>What most teams misunderstand\n<ul>\n<li>Chasing the newest model is not a strategy. Retrieval quality and prompt discipline usually dominate model choice for enterprise tasks<\/li>\n<li>Token cost is not linear with value. 30 to 50 percent of spend is often wasted on context that does not affect the answer<\/li>\n<li>LLM evaluation is not a single metric. You need a layered eval harness and production feedback loop<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h2>Technical deep dive: the real control points<\/h2>\n<h3>The request path that actually matters<\/h3>\n<p>Think of a typical production path:<br \/>\n1) Router and policy gates<br \/>\n2) Retrieval or context construction<br \/>\n3) Planner or tool decision<br \/>\n4) Tool execution and synthesis<br \/>\n5) Output shaping and safety<\/p>\n<p>For each step, define a contract you can trace, evaluate, and roll back.<\/p>\n<ul>\n<li>Router\n<ul>\n<li>Inputs: tenant, task type, budget, risk level<\/li>\n<li>Decision: cheap vs strong model, or deterministic fallback<\/li>\n<li>Failure modes: misrouting high difficulty tasks to cheap models, budget overruns<\/li>\n<\/ul>\n<\/li>\n<li>Retrieval\n<ul>\n<li>Inputs: queries, filters, index snapshot<\/li>\n<li>Decision: top-k, chunking, reranking, query rewriting<\/li>\n<li>Failure modes: stale index, low recall, over-stuffed contexts, citation mismatch<\/li>\n<\/ul>\n<\/li>\n<li>Tools\n<ul>\n<li>Inputs: function schemas, permissions<\/li>\n<li>Decision: which tool, how many calls, timeout budget<\/li>\n<li>Failure modes: schema drift, infinite tool loops, partial failures, sandbox escapes<\/li>\n<\/ul>\n<\/li>\n<li>Synthesis\n<ul>\n<li>Inputs: structured prompts, system instructions<\/li>\n<li>Decision: temperature, max tokens, stop sequences<\/li>\n<li>Failure modes: JSON malformation, verbosity explosions, instruction bleed<\/li>\n<\/ul>\n<\/li>\n<li>Safety and output shaping\n<ul>\n<li>Inputs: policies, PII and toxicity checks<\/li>\n<li>Decision: block, redact, or re-ask<\/li>\n<li>Failure modes: high false positive rate, latency spikes under load<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h3>Trade-offs you actually face<\/h3>\n<ul>\n<li>Model class vs retrieval quality\n<ul>\n<li>Strong model with weak retrieval often underperforms a smaller model with high recall and clean snippets<\/li>\n<\/ul>\n<\/li>\n<li>Context length vs cost and latency\n<ul>\n<li>Long context is not free. Past a point, more text adds noise and slows down sampling<\/li>\n<\/ul>\n<\/li>\n<li>Agents vs direct RAG\n<ul>\n<li>Agents give flexibility but multiply failure modes. If the task is answerable via single-hop RAG, do not introduce tools<\/li>\n<\/ul>\n<\/li>\n<li>JSON mode vs free text\n<ul>\n<li>Structured output simplifies downstream systems but can elevate failure rate if prompts are not tightly constrained<\/li>\n<\/ul>\n<\/li>\n<li>Centralized vs per-tenant indexes\n<ul>\n<li>Centralized is cheaper and easier to maintain. Per-tenant improves privacy and relevance but increases ops overhead<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h3>Failure modes I keep seeing<\/h3>\n<ul>\n<li>Prompt drift: one helpful copy edit tanks output structure<\/li>\n<li>Vendor drift: provider silently updates models and your regression tests light up<\/li>\n<li>Retriever rot: index is stale, synonyms missing, chunking wrong, duplicate docs everywhere<\/li>\n<li>Tool schema drift: a field rename boots your agent into a retry loop<\/li>\n<li>Cost creep: longer contexts, nested retries, uncapped max tokens, forgotten streaming<\/li>\n<li>Evaluator brittleness: LLM-as-judge disagrees with users or flips decisions across versions<\/li>\n<\/ul>\n<h2>Practical solutions that hold up in production<\/h2>\n<h3>1) Treat the chain as code. Version everything<\/h3>\n<ul>\n<li>Version IDs for: prompts, system instructions, tools and their schemas, retriever configuration (k, chunk, rerank), index snapshot hash, model and parameters, safety policies<\/li>\n<li>Attach a run-recipe to every request in logs: all the above plus request_id and tenant<\/li>\n<li>Store prompts and tool schemas in git with code review. Any change ships behind a flag<\/li>\n<\/ul>\n<h3>2) Build a layered evaluation harness<\/h3>\n<ul>\n<li>Offline goldens at three tiers\n<ul>\n<li>Unit: structure-only checks. JSON validity, required fields, citation presence<\/li>\n<li>Task: accuracy on representative questions with canonical references<\/li>\n<li>Scenario: multi-turn sessions, adversarial queries, rate limits<\/li>\n<\/ul>\n<\/li>\n<li>Use LLM graders but calibrate them\n<ul>\n<li>Sample 10 to 20 percent of evals for human adjudication and compute inter-rater reliability<\/li>\n<li>Freeze the judge model for stability. If you upgrade the judge, re-baseline<\/li>\n<\/ul>\n<\/li>\n<li>Canary and shadow in prod\n<ul>\n<li>1 to 5 percent traffic to the new recipe. Compare win rates, cost, and latency against control per tenant<\/li>\n<li>Shadow route complex tasks only if you can bound cost and protect PII<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h3>3) Observability that is actually useful<\/h3>\n<ul>\n<li>Span-level tracing for each step with consistent IDs. Include token counts, p50\/p95 latency, retry counts, error taxonomies<\/li>\n<li>Content-aware metrics without leaking PII\n<ul>\n<li>Redact at source. Hash or label entities before storage<\/li>\n<li>Log features, not raw text. For example: context length, citation count, tool types used<\/li>\n<\/ul>\n<\/li>\n<li>Drift dashboards\n<ul>\n<li>Win rate vs control, structure error rate, hallucination proxy rate, recall@k for retrieval, safety block rate<\/li>\n<\/ul>\n<\/li>\n<li>Incident triage playbook\n<ul>\n<li>Repro with run-recipe, snapshot index at time of failure, freeze vendor model version if possible<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h3>4) Release and rollback like you mean it<\/h3>\n<ul>\n<li>Feature flags by tenant, product surface, and task type<\/li>\n<li>Safe defaults and fallback paths\n<ul>\n<li>If JSON validation fails, re-ask once with structure-only prompt or fall back to deterministic template<\/li>\n<\/ul>\n<\/li>\n<li>Budget guards\n<ul>\n<li>Token and time budgets per request. Hard stops on max tokens and chain depth<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h3>5) Cost control that does not wreck quality<\/h3>\n<ul>\n<li>Route by difficulty\n<ul>\n<li>Start with a smaller model with strict budgets. Escalate only on uncertainty or failed structure checks<\/li>\n<\/ul>\n<\/li>\n<li>Prune context\n<ul>\n<li>Deduplicate chunks, cap per-source tokens, prefer rerankers over stuffing<\/li>\n<\/ul>\n<\/li>\n<li>Cache wisely\n<ul>\n<li>Embed and retrieval caches with TTLs are safe. Response caching works only for deterministic, non-personalized tasks<\/li>\n<\/ul>\n<\/li>\n<li>Prompt compaction\n<ul>\n<li>Replace verbose policy text with short, tested rubrics. Move repetitive instructions into system prompts or tool descriptions<\/li>\n<\/ul>\n<\/li>\n<li>Stop early\n<ul>\n<li>Tight stop sequences, low temperature, reasonable max tokens. Stream responses to improve perceived latency<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h3>6) RAG discipline beats shiny models<\/h3>\n<ul>\n<li>Evaluate retrieval separately\n<ul>\n<li>Track recall@k against labeled queries. If recall is below 0.8, fix RAG before touching the generator<\/li>\n<\/ul>\n<\/li>\n<li>Index hygiene\n<ul>\n<li>Chunk size matched to query granularity, aggressive dedupe, metadata filters that reflect real user constraints<\/li>\n<\/ul>\n<\/li>\n<li>Freshness and rebuilds\n<ul>\n<li>Incremental indexing with delta feeds. Alert on ingestion failures. Rebuild cadence tied to content volatility<\/li>\n<\/ul>\n<\/li>\n<li>Query rewriting vs over-stuffing\n<ul>\n<li>Lightweight rewrites often beat 100k token contexts<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h3>7) Reliability and safety as engineering, not vibes<\/h3>\n<ul>\n<li>Structured output guarantees\n<ul>\n<li>JSON schema or function calling with strict validators. Automatic repair with a single re-ask budget<\/li>\n<\/ul>\n<\/li>\n<li>Timeouts, retries, and idempotency for tools\n<ul>\n<li>Exponential backoff with caps. Circuit breakers for flaky integrations<\/li>\n<\/ul>\n<\/li>\n<li>Guardrails with measured impact\n<ul>\n<li>Content filters that run in parallel, not serial, to keep latency stable. Track false positives to avoid user pain<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h3>8) Team process that survives growth<\/h3>\n<ul>\n<li>Single owner for the LLM recipe. Weekly drift review across prompts, tools, and retrieval<\/li>\n<li>Vendor update watch\n<ul>\n<li>Pin versions where possible. If not, catch deltas in nightly batch evals<\/li>\n<\/ul>\n<\/li>\n<li>Keep product and ops aligned\n<ul>\n<li>Share win rates, cost per task, and top failure themes with PMs. Kill nice-to-haves that burn tokens without moving metrics<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h2>Business impact you can expect if you do this well<\/h2>\n<ul>\n<li>Cost\n<ul>\n<li>20 to 40 percent reduction from routing, pruning, and budgets, without noticeable quality loss<\/li>\n<li>Another 10 to 20 percent from retrieval and embedding caching and dedupe<\/li>\n<\/ul>\n<\/li>\n<li>Performance\n<ul>\n<li>p95 latency stabilizes once you cap chain depth, parallelize safety checks, and remove over-long contexts<\/li>\n<\/ul>\n<\/li>\n<li>Quality and risk\n<ul>\n<li>Most hallucinations in enterprise RAG track to retrieval recall and snippet cleanliness. Fixing those drops escalations materially<\/li>\n<li>Canary and rollback stop week-long outages from vendor drifts and prompt accidents<\/li>\n<\/ul>\n<\/li>\n<li>Scaling risk\n<ul>\n<li>As tenant count grows, versioning and flags prevent cross-tenant regressions. You will not be firefighting silent breakages at midnight<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h2>Key takeaways<\/h2>\n<ul>\n<li>Version prompts, tools, models, and index snapshots like code. Attach run-recipe metadata to every call<\/li>\n<li>Build a layered eval harness with offline goldens, LLM graders you calibrate, and canary in prod<\/li>\n<li>Observe at the span level. Track token, latency, and structure failures, not just success flags<\/li>\n<li>Control cost via routing by difficulty, context pruning, and strict budgets. Caching helps, but only where deterministic<\/li>\n<li>Fix retrieval before changing models. RAG quality dominates outcomes in most enterprise cases<\/li>\n<li>Ship behind flags, roll out by tenant, and keep a deterministic fallback<\/li>\n<\/ul>\n<h2>If you need a hand<\/h2>\n<p>If parts of this feel familiar, that is normal. The fixes are repeatable but require discipline across product, data, and infra. If you&#8217;re running into drift, random outages, or a bill that keeps climbing, this is exactly the kind of thing I help teams fix when systems start breaking at scale.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The ugly part of LLMs: the system works until it silently doesn&#8217;t If your first LLM feature went live and then support tickets tripled, latency wandered, and your cloud bill&#8230; <\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[6],"tags":[17,22,21],"class_list":["post-37","post","type-post","status-publish","format-standard","hentry","category-mlops-llmops","tag-ai-cost","tag-ai-observability","tag-llmops"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/posts\/37","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/comments?post=37"}],"version-history":[{"count":1,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/posts\/37\/revisions"}],"predecessor-version":[{"id":81,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/posts\/37\/revisions\/81"}],"wp:attachment":[{"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/media?parent=37"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/categories?post=37"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/tags?post=37"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}