{"id":33,"date":"2025-11-12T14:37:22","date_gmt":"2025-11-12T14:37:22","guid":{"rendered":"https:\/\/angirash.in\/blog\/2025\/08\/12\/ai-observability-stop-guessing-start-instrumenting\/"},"modified":"2026-04-09T23:25:36","modified_gmt":"2026-04-09T23:25:36","slug":"ai-observability-stop-guessing-start-instrumenting","status":"publish","type":"post","link":"https:\/\/angirash.in\/blog\/2025\/11\/12\/ai-observability-stop-guessing-start-instrumenting\/","title":{"rendered":"AI Observability: Stop Guessing, Start Instrumenting"},"content":{"rendered":"<h2>The uncomfortable truth: you are flying blind<\/h2>\n<p>Most AI incidents are not outages. They are quiet quality regressions, silent cost blowups, and vendor drift that no one notices for weeks. Dashboards are green while your LLM starts citing stale PDFs, a prompt tweak doubles token burn, or a model update changes how tools are called. By the time support tickets or NPS catch it, the damage is done.<\/p>\n<p>I rarely see AI observability done right. Teams log request and response JSON and call it a day. That is not observability. That is a transcript. You need pipeline traces, semantic diffs, evaluation hooks, and version lineage tied to unit economics. Otherwise you are guessing.<\/p>\n<h2>Where this breaks in real systems<\/h2>\n<h3>Symptoms in production<\/h3>\n<ul>\n<li>Quality dips show up as vague complaints: hallucinations, wrong links, repetitive answers, tool loops<\/li>\n<li>Costs creep up 20 to 50 percent month over month with no clear reason<\/li>\n<li>Latency tail gets worse even though infra looks fine<\/li>\n<li>Releases are slow because no one can prove that new prompts or models are safe<\/li>\n<li>Debugging takes days because you cannot reproduce the exact run context<\/li>\n<\/ul>\n<h3>Why it happens<\/h3>\n<ul>\n<li>Traditional APM stops at the API boundary. AI pipelines are multi-hop: retrieval, reranking, function calls, synthesis, moderation<\/li>\n<li>Key signals are semantic. A 200 OK with a fluent answer may still be wrong or ungrounded<\/li>\n<li>Model providers change behavior, safety filters, and tokenization. Silent drift is real<\/li>\n<li>Cardinality explodes. User, session, doc ids, tool names, model versions, prompt hashes. If you do not design for it, your telemetry bill will punch you in the face<\/li>\n<\/ul>\n<h3>What most teams misunderstand<\/h3>\n<ul>\n<li>You cannot evaluate quality only offline. You need light-weight online checks with smart sampling and strict PII handling<\/li>\n<li>Cost is not dollars per 1k tokens. It is cost per unit of value. Per ticket resolved, per SQL generated, per doc summarized accurately<\/li>\n<li>\u201cStore the whole payload\u201d is not a strategy. You need pointers, retention policies, and redaction. Keep the ability to reproduce without leaking data<\/li>\n<\/ul>\n<h2>Technical deep dive: what to instrument and how<\/h2>\n<h3>Trace the pipeline, not just the API call<\/h3>\n<p>Use distributed tracing for every hop in the LLM chain. If you already run OpenTelemetry, extend it. Each request gets a correlation id across:<\/p>\n<ul>\n<li>Input normalization span: user, channel, feature flags, A\/B bucket<\/li>\n<li>Retrieval span: vector store query, filters, top-k, latency, result ids, recency, index version<\/li>\n<li>Rerank span: model id, scores, chosen docs<\/li>\n<li>Tool spans: tool name, arguments, retries, status, latency<\/li>\n<li>Generation span: model id, temperature, system prompt hash, user prompt hash, output tokens, provider request id<\/li>\n<li>Safety span: categories hit, scores, action taken<\/li>\n<li>Post-processing span: structured extraction, schema validation results<\/li>\n<\/ul>\n<p>For each span, capture:<\/p>\n<ul>\n<li>Minimal structured fields for joins: <code>trace_id<\/code>, <code>span_id<\/code>, <code>user_id<\/code> or hashed surrogate key, <code>session_id<\/code>, model version, prompt hash, embedding version, index version, tool version<\/li>\n<li>Numeric metrics: latency, input and output tokens, cost estimate, cache hit, retries<\/li>\n<li>Pointers to payloads: <code>s3_uri<\/code> or <code>blob_id<\/code> with strict access control. Do not stuff full texts in your tracing backend<\/li>\n<\/ul>\n<h3>Log what matters, without burning money<\/h3>\n<ul>\n<li>Sample raw payloads by policy. Example: 1 percent of happy-path, 10 percent of tool errors, 100 percent of safety blocks, 100 percent of high-value customers<\/li>\n<li>Apply PII scrubbing before storage. Redact obvious fields, tokenize emails and phone numbers, mask secrets<\/li>\n<li>Keep raw text out of timeseries stores. Put it in cheap object storage with a WORM policy and envelope encryption. Store only pointers in traces<\/li>\n<li>Control cardinality. Normalize labels like <code>model=gpt-4o<\/code>, <code>tool=sql_agent<\/code>, <code>index=docs_v23_2025-07-01<\/code>. Avoid dynamic labels like doc titles<\/li>\n<\/ul>\n<h3>Evaluation built into observability<\/h3>\n<p>Add lightweight online checks:<\/p>\n<ul>\n<li>Groundedness: for RAG, check that top-k retrieved docs contain answer spans. Simple overlap score works surprisingly well<\/li>\n<li>Source coverage: percent of generations with at least one source reference or citation id<\/li>\n<li>Constraint adherence: JSON schema validation pass rate, function argument coercion rate<\/li>\n<li>Outcome signals: user accept rate, support deflection, query success from downstream metrics<\/li>\n<\/ul>\n<p>Then run async, heavier evals on sampled traffic daily:<\/p>\n<ul>\n<li>LLM-as-judge with stable reference prompts. Track drift in judge outputs too<\/li>\n<li>Golden set replay: stable prompts and expected answers through the current stack. Fail release if quality drops beyond budget<\/li>\n<li>Vendor drift check: weekly cross-provider reruns on a fixed dataset to catch behavior changes<\/li>\n<\/ul>\n<h3>Version everything<\/h3>\n<p>Attach these fields to every trace:<\/p>\n<ul>\n<li><code>model_provider<\/code>, <code>model_id<\/code>, <code>model_snapshot<\/code> if available<\/li>\n<li><code>prompt_hash<\/code> for each role block, not a single blob<\/li>\n<li><code>retrieval_index_version<\/code>, <code>embedding_model_version<\/code><\/li>\n<li><code>tool_version<\/code> and <code>tool_config_hash<\/code><\/li>\n<li><code>app_commit_sha<\/code> or build id, plus feature flag set<\/li>\n<\/ul>\n<p>If you cannot reconstruct a production run locally in under 5 minutes, you are not versioning enough.<\/p>\n<h2>Practical patterns that work<\/h2>\n<h3>Minimal viable observability in 2 weeks<\/h3>\n<ul>\n<li>Correlation id everywhere. Propagate on inbound, attach to all spans, logs, and downstream calls<\/li>\n<li>Trace the pipeline with OpenTelemetry and send spans to whatever you already use: Datadog, Honeycomb, Tempo. Add a link to payload storage<\/li>\n<li>Add token and cost meters: <code>input_tokens<\/code>, <code>output_tokens<\/code>, <code>estimated_cost_usd<\/code>. Compute cost in the app, not in dashboards<\/li>\n<li>Build a single dashboard per product path with these SLOs:\n<ul>\n<li>p95 latency by stage and end-to-end<\/li>\n<li>success rate, tool error rate, safety block rate<\/li>\n<li>groundedness score distribution for RAG<\/li>\n<li>cache hit rate, index freshness<\/li>\n<li>cost per successful outcome<\/li>\n<\/ul>\n<\/li>\n<li>Implement sampling with a policy file. Keep it in git. Changes go through PRs<\/li>\n<\/ul>\n<h3>Production hardening over 90 days<\/h3>\n<ul>\n<li>Data contracts. Define an event schema for AI runs and enforce it with tests. Break the build if required fields go missing<\/li>\n<li>Canary everything. Roll out new prompts or models to 5 percent traffic with hard kill switches. Gate on quality budgets<\/li>\n<li>Shadow mode for tools. Before letting the agent write to production systems, run in read-only and compare suggested vs executed actions<\/li>\n<li>Backfill-friendly storage. Store all AI events in a columnar warehouse. Join traces with business events for true cost per value<\/li>\n<li>Automated drift detection. Trigger alerts when:\n<ul>\n<li>Top doc domains in retrieval change abruptly<\/li>\n<li>Judge-based quality drops more than threshold<\/li>\n<li>Output token distribution shifts significantly<\/li>\n<li>Provider request error mix changes<\/li>\n<\/ul>\n<\/li>\n<li>Runbooks that engineers actually use. Each alert must have: where to look, how to repro, how to roll back, known bad versions<\/li>\n<\/ul>\n<h2>Architecture reference that avoids surprises<\/h2>\n<ul>\n<li>Ingestion: app emits <code>ai_run_started<\/code> and <code>ai_run_finished<\/code> events to a message bus with correlation ids<\/li>\n<li>Tracing: OpenTelemetry spans per stage with just enough labels, no raw payloads<\/li>\n<li>Payloads: redact then store prompts, retrieved chunks, and completions in object storage keyed by <code>trace_id\/span_id<\/code><\/li>\n<li>Metrics: latency, token counts, cache hits pushed to your metrics backend with fixed label sets<\/li>\n<li>Warehouse: AI events joined with product outcomes, customer tier, cost rates. This is where you compute cost per successful action<\/li>\n<li>Evaluator service: consumes sampled runs nightly, runs judges and schema checks, writes scores back to warehouse and emits alerts<\/li>\n<li>Governance: secrets and PII policies baked into ingestion. Access to payloads behind just-in-time approval<\/li>\n<\/ul>\n<p>Trade-offs:<\/p>\n<ul>\n<li>More sampling lowers storage costs but increases time to detect regressions. Tune per segment<\/li>\n<li>Inline checks add a few ms. Worth it. Heavy eval stays async<\/li>\n<li>Storing pointers keeps tracing cheap, but complicates investigations. Build a one-click fetch in your internal tool<\/li>\n<li>Vendor SDKs give you convenience but hide request ids and retries. Wrap them so you still capture the guts<\/li>\n<\/ul>\n<h2>Failure modes I keep seeing<\/h2>\n<ul>\n<li>Prompt change increases tool verbosity, which explodes output tokens by 30 percent overnight. No alert because success rate looked fine<\/li>\n<li>Vector index silently lags because a nightly job failed. Retrieval returns old docs and groundedness degrades for weeks<\/li>\n<li>Provider tweaks function calling. Your agent loops subtly more often. Latency p95 climbs, cost per outcome spikes<\/li>\n<li>Cache key changed with a library update. Hit rate drops to near zero. Token burn climbs fast<\/li>\n<li>JSON extraction passes locally but fails in production due to non-breaking schema drift. No schema adherence metric, so it hides<\/li>\n<\/ul>\n<h2>Business impact you can actually measure<\/h2>\n<ul>\n<li>Cost: Teams that add token and cache metrics tied to outcomes usually cut spend 15 to 35 percent in a month. Most savings come from fixing cache keys, prompt bloat, and tool loop retries<\/li>\n<li>Performance: Stage-level latency breaks down where to optimize. It is common to cut p95 by 20 percent by fixing retrieval and tool timeouts before touching the model<\/li>\n<li>Risk: Canary and quality gates reduce bad releases. You trade an extra hour of review for not shipping a 2 week regression<\/li>\n<li>Speed: With traceable runs and version lineage, reproducing incidents drops from days to minutes. That is real engineering time back<\/li>\n<\/ul>\n<h2>Key takeaways<\/h2>\n<ul>\n<li>Observability for AI is pipeline-first and semantics-aware. Plain request logs are not enough<\/li>\n<li>Version everything that can change behavior: prompts, models, indexes, tools, flags<\/li>\n<li>Measure cost per unit of value, not just per 1k tokens<\/li>\n<li>Add light online checks and heavy offline evals with sampling. Tie both into alerts<\/li>\n<li>Keep raw payloads out of your metrics stack. Store pointers with strict access<\/li>\n<li>Control cardinality up front or your telemetry bill will control you<\/li>\n<li>Build canary and rollback paths before you need them<\/li>\n<\/ul>\n<h2>If you need a hand<\/h2>\n<p>If your AI stack is showing weird quality dips, runaway costs, or you just cannot reproduce issues, this is fixable. I help teams design and implement practical AI observability that engineers will actually use, without blowing up your telemetry bill. If you are running into similar issues, this is exactly the kind of thing I help teams fix when systems start breaking at scale.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The uncomfortable truth: you are flying blind Most AI incidents are not outages. They are quiet quality regressions, silent cost blowups, and vendor drift that no one notices for weeks&#8230;. <\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[6],"tags":[17,22,21],"class_list":["post-33","post","type-post","status-publish","format-standard","hentry","category-mlops-llmops","tag-ai-cost","tag-ai-observability","tag-llmops"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/posts\/33","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/comments?post=33"}],"version-history":[{"count":1,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/posts\/33\/revisions"}],"predecessor-version":[{"id":79,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/posts\/33\/revisions\/79"}],"wp:attachment":[{"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/media?parent=33"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/categories?post=33"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/tags?post=33"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}