The dashboard says 92% accuracy. Your users disagree.
If your eval sheet shows high scores but support tickets are spiking, you do not have a model problem. You have a measurement problem. I see this pattern in RAG assistants, helpdesk bots, summarizers, even code agents. Pretty charts hide ugly tails. Teams ship with confidence, then get blindsided by hallucinations in the exact segments that matter.
This post is not a tutorial. It is a field note on why your metrics lie and what to do about it.
Where the problem shows up and why
- RAG QA: Exact-match or ROUGE says you are great. Users see confident, wrong answers for queries outside your doc coverage.
- Summarization: BLEU looks fine. Execs complain the summary missed the only decision they cared about.
- Classifiers and routers: Macro F1 is solid. Finance-specific intents tank and send payroll cases to a generalist queue.
- Agents and tool use: Pass@k looks impressive. In production, the agent burns API quota retrying tools it should have abandoned.
Why this happens in real systems:
- Open-ended tasks do not have one correct string. Text-sim metrics overreward paraphrase and underreward factuality.
- Average scores hide the tail. P95 quality is what users feel.
- LLM-as-judge is biased and unstable if you do not calibrate it. One prompt change moves your “groundedness” score by 10 points.
- Offline evals ignore retrieval coverage and tool reliability. The generator learns to hallucinate around missing context and still pass BLEU.
- Datasets are contaminated or too easy. Synthetic sets created by the same model you evaluate will flatter you.
Most teams misunderstand two things: 1) evaluation must reflect your architecture, not just the output, and 2) quality must be measured by business risk, not average n-gram overlap.
Technical deep dive: measure the system you actually built
Think in components and end-to-end
If your system is RAG + tools + generation, evaluate each stage and the whole journey.
Component metrics that matter:
- Retrieval
- Query understanding coverage: % queries that map to known intents or doc segments
- Recall@k on answerable queries: did we fetch any chunk that contains evidence
- Negative controls: % of no-answer queries that correctly return no evidence
- Reranker precision@k and AUC on human-marked relevant pairs
- Tooling
- Tool success rate: % calls that return parsable, semantically valid results
- Plan efficiency: average tool calls per successful task; early-stop correctness
- Backoff correctness: when tool fails, % times we escalate or say “cannot complete”
- Generation
- Groundedness: % claims supported by cited context
- Citation coverage: % of key facts with a citation to retrieved text
- Refusal accuracy: correct refusals on jailbreak and policy tests
- System
- Latency budget by stage, P95
- Cost per successful task, not per request
End-to-end metrics that matter:
- Task success rate by segment and difficulty tier
- P95 quality score on a calibrated scale tied to business acceptance
- Abstention quality: correctness of “I don’t know” decisions
- Escalation rate and handoff quality to humans
The trap of LLM-as-judge
I am not against LLM judges. I am against uncalibrated judges.
Common failure modes:
- Judge prompt anchors the outcome. Change one example and your win rates swing.
- Same-model judging. The model favors its own phrasing.
- Single-judge instability. Day to day variance masks regressions.
Fixes that work:
- Use pairwise evaluation with win/tie/loss, not absolute 1-10 scores.
- Judge ensembles: mix models and prompts, majority vote or calibrated weighted vote.
- Human anchor sets: 100-300 items periodically adjudicated by humans to compute judge drift and agreement (Cohen’s kappa or Krippendorff’s alpha).
Datasets that do not lie
- Stratify by intent, domain, difficulty, and customer segment. Report per-bucket, not just aggregate.
- Temporal splits. If your corpus changes weekly, eval on queries that reference last week’s changes.
- Hold back a private set for release gating. Do not tune prompts on it.
- Include negative controls: unanswerable queries, adversarial phrasings, policy edge cases.
- Limit synthetic data from the same base model. If you must, cross-generate with a stronger or different family model and spot check.
Determinism and replay
- Fix temperature, top-p, and seeds for offline evals. Save outputs.
- Version everything: prompts, tools, indexes, corpora. Pin them in traces.
- Build a replay harness: given a trace, you can fully reproduce the call graph and re-score with new judges.
Practical solutions that survive production
Define a Quality SLO you can defend
- Create an acceptance function Q(x, y) calibrated by humans that returns pass or fail for your real task definition.
- Gate releases on P95(Q) by segment, not just the mean.
- Set a risk budget for false positives vs abstentions. In compliance-heavy tasks, a correct refusal is a success.
Build a scorecard, not a single number
Include at least these dimensions and show them by segment:
- Retrieval: recall@k, negative control pass rate
- Groundedness and citation coverage
- Task success and abstention accuracy
- Safety refusal accuracy
- P95 latency and cost per successful task
Weight the scorecard by business impact. A 2% improvement on Tier A queries may be worth more than 20% on the long tail.
Make judges trustworthy
- Use pairwise head-to-head with ELO-like win rates for model or prompt comparisons.
- Calibrate judges monthly against a human panel. Fire or reweight prompts that drift.
- Keep a gold adjudication set with provenance. Never tune to it.
Test retrieval like you mean it
- Build retrieval probe sets where the answer appears in exactly one chunk so you can measure true recall.
- Add doc dropout tests: remove a relevant document and verify the system abstains instead of hallucinating.
- Track hit position distributions. If most hits are at rank 15 and you only fetch 8, you have a configuration problem.
Ship with guardrails and feedback loops
- Shadow or interleave new variants with 1-5% of traffic. Compare win rates online using pairwise judges plus human spot checks.
- Confidence-aware routing: if retrieval recall or judge consensus is low, escalate or ask a clarifying question.
- Log structured traces: prompt, retrieved chunks, tool calls, model versions, costs, latencies, final output. Make offline regression analysis boring and fast.
Optimize on the cost-quality-latency frontier
- Run sweeps across temperature, tool timeouts, retrieval k, and model choices.
- Plot cost per successful task vs P95 latency vs P95 quality. Choose points on the Pareto frontier, not just the highest offline score.
- For spiky loads, precompute or cache high-traffic answers. Measure cache hit quality, not just hit rate.
What this means for the business
- Hidden tail risk is expensive. I have seen a system with 93% average “helpfulness” generate a compliance incident from a 2% bucket of queries. The clean-up cost dwarfed a month of model spend.
- Hallucinations cost more than abstentions. If your abstention accuracy is low, you are buying goodwill with refunds.
- Bad retrieval silently burns compute. One audit showed recall@10 at 35%. The model hallucinated plausible answers, passed BLEU, and tripled support escalations.
- Reliable evals change roadmap priorities. Teams stop chasing a 0.5 BLEU bump and focus on fixing reranking or adding negative controls, which actually moves KPIs.
- At scale, nightly offline evals with calibrated judges are cheaper than reactive hotfixes. You catch regressions before your largest customer does.
Key takeaways
- Average scores are vanity metrics. Report P95 quality and segment-level results.
- Evaluate the pipeline, not just the final text. Retrieval and tools decide most outcomes.
- LLM-as-judge is useful only with calibration, ensembles, and human anchors.
- Build datasets that fight you back: adversarial, temporal, negative controls, and private holdouts.
- Gate releases on a business-defined Quality SLO with a risk budget for abstention vs error.
- Optimize for cost per successful task, not cost per request.
- Log everything needed to replay and re-score. Determinism beats debates.
If you need a second set of eyes
If your metrics look great but your users are not seeing it, you are likely measuring the wrong thing or measuring it the wrong way. I help teams rebuild evaluation from the system up, put guardrails in place, and get confident about shipping changes without surprises. If that is where you are stuck, we can dig into your traces and fix it fast.

