{"id":106,"date":"2025-07-14T10:23:45","date_gmt":"2025-07-14T10:23:45","guid":{"rendered":"https:\/\/angirash.in\/blog\/2025\/07\/14\/what-nobody-tells-you-about-monitoring-llm-systems\/"},"modified":"2025-07-14T10:23:45","modified_gmt":"2025-07-14T10:23:45","slug":"what-nobody-tells-you-about-monitoring-llm-systems","status":"publish","type":"post","link":"https:\/\/angirash.in\/blog\/2025\/07\/14\/what-nobody-tells-you-about-monitoring-llm-systems\/","title":{"rendered":"What nobody tells you about monitoring LLM systems"},"content":{"rendered":"<h2>The quiet failure mode in LLM products<\/h2>\n<p>Most LLM systems do not fail loudly. They drift. Cost creeps, answers get a bit worse, latency tails fatten, and nobody notices until a customer escalates or a monthly bill lands. The charts you already have do not catch it. CPU is fine. Error rates look normal. Your latency P50 is green. Meanwhile P99 users are stuck; your RAG is stuffing 70 percent irrelevant context; JSON is malformed 8 percent of the time so your orchestrator silently retries and doubles latency; and the model provider shipped a new tokenizer that broke your prompt budget.<\/p>\n<p>If any of that stung, good. Monitoring LLM systems is not observability with prompt logs sprinkled on top. It is product telemetry, information retrieval metrics, and model evaluation wired together and kept honest.<\/p>\n<h2>Where this breaks and why teams miss it<\/h2>\n<ul>\n<li>RAG pipelines: Retrieval recall drifts as your corpus changes. Nobody tracks it. Top-k looks the same, quality doesn\u2019t.<\/li>\n<li>Prompt templates: Versioning is ad hoc. A copy update in Marketing inflates token counts and flips output style. Latency jumps 40 percent overnight.<\/li>\n<li>Vendor changes: Model upgrades roll out behind the same API name. Refusal rate goes up. You treat 200 OK as success and miss it.<\/li>\n<li>Agents and tools: Tool-calling loops or partial tool failures hide inside 200 OK. The agent \u201chandled\u201d it by apologizing.<\/li>\n<li>Schema outputs: JSON mode is not a contract. Parse error rate is your real reliability metric and it is often double digits at tail.<\/li>\n<li>Cost: Context is cheap in dev and very expensive in prod. Without relevance accounting you spend most of your budget on noise.<\/li>\n<\/ul>\n<p>Teams miss this because they monitor the box, not the task. Infra metrics and generic logs do not tell you if the answer is grounded, if the agent looped, or if context was actually useful.<\/p>\n<h2>The technical core you actually need to monitor<\/h2>\n<p>Think in traces, not requests. A single user request usually fans out: query parsing, retrieval, reranking, prompt assembly, generation, structured parsing, post-processing, tool calls, safety filters. Each needs its own metrics and artifacts.<\/p>\n<h3>Trace model that works in production<\/h3>\n<p>For every request, capture a span graph with these minimum fields:<\/p>\n<ul>\n<li>Request: request_id, user_id, org_id, session_id, feature_flag, locale<\/li>\n<li>Prompting: prompt_template_id, prompt_hash, system_instructions_version, stop_sequences, temperature, max_tokens<\/li>\n<li>Models: gen_model_name+version, embedding_model_name+version, reranker_model+version<\/li>\n<li>Retrieval: index_id, index_build_id, k, filters, retrieved_doc_ids, reranked_scores<\/li>\n<li>Outputs: raw_text, tokens_in, tokens_out, safety_refusal, json_parse_ok, schema_version, tool_calls[], retries_count, cache_hit<\/li>\n<li>Grounding: source_doc_ids used in final answer, grounding_score (if computed)<\/li>\n<li>Outcome: user_clicks, task_success_flag, human_rating (if available), llm_judge_score<\/li>\n<\/ul>\n<p>If you cannot attach index_build_id and prompt_template_id, you cannot audit regressions. You will guess.<\/p>\n<h3>Metrics that matter (and the traps)<\/h3>\n<ul>\n<li>Retrieval quality: recall@k on a moving target. Maintain a rolling canary set per content segment. Do not rely on cosine thresholds alone. Track distribution drift of embedding vectors across index builds.<\/li>\n<li>Context efficiency: relevant_tokens \/ context_tokens. In most RAG systems I audit, this is under 40 percent. Set a target and backpressure the retriever if you keep sending fluff.<\/li>\n<li>Groundedness: hallucination proxies via LLM-as-judge are useful only if calibrated. Keep a small human-labeled set to anchor the judge and track its disagreement rate.<\/li>\n<li>Schema conformance: strict JSON parse success rate, not just presence of braces. Track by model version and prompt template. Anything below 98 percent will hurt you under load.<\/li>\n<li>Latency budgets: break down P50\/P95\/P99 across stages. Streaming hides generation latency but not user-perceived time-to-first-token. Watch queue time and provider retry backoff.<\/li>\n<li>Agent health: tool call success rate, loop detection (repeated tool+prompt pairs), depth limit hits, and unresolved tool errors. Treat \u201capologies\u201d as soft failures.<\/li>\n<li>Refusal rate: measure by intent category. A model upgrade that doubles refusals on code generation is a real incident even if HTTP is 200.<\/li>\n<\/ul>\n<h3>Failure modes you will meet<\/h3>\n<ul>\n<li>Tokenizer drift: provider updates shrink your max context in tokens at the same byte size. Prompts get truncated, outputs degrade.<\/li>\n<li>Unicode and markdown edges: code blocks and tables break streaming parsers. Your front end freezes because chunk boundaries landed mid-UTF-8.<\/li>\n<li>Silent RAG decay: new content types enter the index, chunker splits poorly, reranker overfits to headings, recall sinks.<\/li>\n<li>Retry storms: 3 percent schema failures at P99 trigger auto-repair prompts, which hit rate limits, which balloons latency and cost.<\/li>\n<li>Cache poison: naive caching stores outputs that include time-sensitive or user-specific data. Replays stale or private content.<\/li>\n<\/ul>\n<h2>Practical ways to fix this without building a SIEM clone<\/h2>\n<h3>1) Build a thin trace schema and stick to it<\/h3>\n<p>You need structured, queryable traces with linked artifacts. Ship them to whatever you already use for logs or a lightweight vector-capable store. Do not store full PII or full documents unless redacted. Mask user input by default. Attach hashes and IDs for lookup.<\/p>\n<h3>2) Create an evaluation supply chain<\/h3>\n<ul>\n<li>Offline golden sets: 200 to 1,000 examples per use case, versioned. Include edge cases and anti-prompts.<\/li>\n<li>Synthetic queries: generate from your corpus and validate a small sample with humans. Good for catching recall decay.<\/li>\n<li>Judges: use LLM-as-judge for scale but calibrate weekly with human disagreements. Track judge drift.<\/li>\n<li>Pairwise testing: when changing prompts or models, run interleaved A\/B with pairwise preference judgments. It surfaces subtle regressions faster than absolute scoring.<\/li>\n<\/ul>\n<h3>3) Monitor RAG like an IR system<\/h3>\n<ul>\n<li>Index health: recall@k against anchors, average reranker margin, doc churn rate, embedding distribution shift (cosine mean\/variance deltas), and orphan chunks rate.<\/li>\n<li>Context density alerts: fire if relevant_tokens\/context_tokens drops by X percent over 500 requests in any segment.<\/li>\n<li>Re-embedding policy: pin embedding model versions. Rebuild on content type shifts, not calendar time. Canary the new build.<\/li>\n<\/ul>\n<h3>4) Treat structured output as a contract<\/h3>\n<ul>\n<li>Validate with a real schema, not a regex. Measure strict conformance, repair attempts, and success after repair.<\/li>\n<li>Use constrained decoding or JSON mode where it works, but still verify. Different providers break differently.<\/li>\n<li>Cap retries and surface partial success to the user when appropriate. Hiding retries creates SLO illusions.<\/li>\n<\/ul>\n<h3>5) Budget and guardrails at runtime<\/h3>\n<ul>\n<li>Token budgets per feature and per request. Refuse to append more context if density falls below threshold.<\/li>\n<li>Agent loop breaker: detect repeated tool+instruction cycles and stop with a helpful fallback.<\/li>\n<li>Backpressure: when provider rate limits hit, degrade features predictably instead of cascading retries.<\/li>\n<\/ul>\n<h3>6) Cost telemetry that maps to outcomes<\/h3>\n<ul>\n<li>Cost per successful task, not per request. Segment by prompt template, model, and customer tier.<\/li>\n<li>Waste ledger: tokens on irrelevant context, retries, refusals, and long-form fluff. These buckets usually hide 20 to 40 percent of spend.<\/li>\n<li>Cache ROI: track hit rate and savings net of staleness incidents. If you cannot quantify it, you probably should not cache that path.<\/li>\n<\/ul>\n<h3>7) Vendor hygiene<\/h3>\n<ul>\n<li>Pin model versions. If a provider does not support it, build your own shadowing harness and detect behavioral shifts with small traffic slices.<\/li>\n<li>Classify responses with a refusal detector. Treat policy refusals as typed outcomes, not success.<\/li>\n<li>Keep a compatibility test for new tokenizers and JSON modes. Run before rollout.<\/li>\n<\/ul>\n<h2>Trade-offs you will actually argue about<\/h2>\n<ul>\n<li>More logging vs privacy: log hashes and IDs by default, store full text only for a small, sampled, redacted set. Accept lower forensic power for compliance sanity.<\/li>\n<li>LLM-judge reliance: it is biased but cheap. Calibrate and accept some noise, or pay for slower human review. I usually mix 90 percent judge, 10 percent human.<\/li>\n<li>Reranking depth: better quality, higher latency and cost. Rerank fewer candidates on mobile traffic or low value tiers.<\/li>\n<li>A\/B rigor vs speed: interleaving and pairwise testing are great, but they slow releases. For small teams, run short high-signal tests and ship.<\/li>\n<\/ul>\n<h2>Business impact you can forecast<\/h2>\n<ul>\n<li>Cost: trimming irrelevant context by 30 percent typically cuts total spend by 15 to 25 percent, with zero user-visible regression.<\/li>\n<li>Reliability: improving strict JSON conformance from 92 to 98 percent removes most silent retries. Expect 20 to 40 percent reduction in tail latency incidents.<\/li>\n<li>Quality: holding recall@k steady through content growth prevents the slow churn of user trust. If your support search is RAG-backed, this shows up as fewer escalations, not higher CTR.<\/li>\n<li>Vendor risk: model behavior drift without version pinning is an incident waiting to happen. The day it hits, you lose a week.<\/li>\n<\/ul>\n<h2>Key takeaways<\/h2>\n<ul>\n<li>Monitor the task, not the box. Groundedness, recall, schema conformance, and agent loops beat CPU and generic error rates.<\/li>\n<li>Version everything that affects behavior: prompts, models, indexes, chunkers, rerankers.<\/li>\n<li>Build a trace with enough structure to recreate a decision. Hashes and IDs, not blobs of text.<\/li>\n<li>Calibrate your LLM judges with a standing human set. Otherwise your metrics will lie.<\/li>\n<li>Cost control comes from relevance and retries, not discount codes from providers.<\/li>\n<li>Expect vendor drift. Pin versions or shadow and detect.<\/li>\n<\/ul>\n<h2>If you need a steady hand<\/h2>\n<p>If this feels familiar and you want to stop guessing, this is the kind of work I do. I help teams put real observability and evaluation around LLM systems, cut wasted spend, and keep quality flat while you scale. Reach out if your graphs are green and your users still complain.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The quiet failure mode in LLM products Most LLM systems do not fail loudly. They drift. Cost creeps, answers get a bit worse, latency tails fatten, and nobody notices until&#8230; <\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[6],"tags":[17,26,22],"class_list":["post-106","post","type-post","status-publish","format-standard","hentry","category-mlops-llmops","tag-ai-cost","tag-ai-eval","tag-ai-observability"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/posts\/106","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/comments?post=106"}],"version-history":[{"count":0,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/posts\/106\/revisions"}],"wp:attachment":[{"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/media?parent=106"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/categories?post=106"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/tags?post=106"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}