Your offline eval says 92% accuracy. Your users bail at the spinner. I have seen a 30% drop in chat engagement when time-to-first-token drifted from 500 ms to 1.8 s, with the same model and same prompts. In production, the right model is the one that hits your latency SLO while staying above your minimum acceptable quality. Everything else is a science project.
Where the accuracy-latency problem actually shows up
- RAG gone slow: You add re-rankers, expand k, and stuff longer context. Accuracy plateaus while p95 doubles.
- Agent orchestration: Tool-calling chains stack three model calls and a few HTTP round trips. Each is fast on its own, together they blow your UX budget.
- Structured outputs: JSON-constrained decoding to avoid post-processing. Great quality, huge token-level slowdown.
- Vendor roulette: Bursty traffic hits rate limits. Retries hide in SDKs. p50 looks fine, p95 is on fire.
Why it happens:
– Teams rarely set a hard time budget per request path. Without a budget, every “small” improvement ships and your p95 becomes a graveyard of good intentions.
– Quality is measured offline and averaged. Users live at p95. Accuracy gains that cost tail latency are usually a net loss.
– Context inflation. Fat prompts are the new N+1 query.
What most teams misunderstand:
– The frontier is not model-size-only. It is the combined effect of retrieval, compression, decoding strategy, batching, and caching.
– p50 latency is a vanity metric. Optimize to your abandonment threshold at p95.
– You do not need maximum accuracy. You need the minimum accuracy that still converts, with headroom for scale and bad days.
Technical deep dive: designing for the frontier
Set budgets before touching models
Pick explicit budgets by interaction type:
– Typeahead or autocomplete: TTFB under 60 ms, full under 120 ms
– Search with re-rank: TTFB under 150 ms, p95 under 300 ms
– Chat assistant: TTFB under 400–600 ms, full under 2–4 s
– Agent tool call: Each hop under 300 ms service time, 2–3 hops max on hot path
Put these budgets in code. Log them. Fail fast when breached.
Retrieval is usually your first latency and accuracy lever
- Keep k small by default. Start with k=8. Only escalate k if uncertainty is high.
- Use asymmetric cascades: fast bi-encoder for recall, then optional cross-encoder re-rank on top 20. Not 200.
- Chunking matters more than people admit. Overlapping 20–30% with 300–600 token chunks tends to be a good balance. Giant chunks tank latency and often reduce accuracy by diluting relevance.
- Cache aggressively where data is stable: embedding vectors, ANN search results, re-ranker outputs. Invalidate with content versions, not TTL hand-waving.
Generation: where small switches pay big dividends
- Temperature and top-p affect latency via token count. Lower token counts shorten tail time. Check your average output length before you start buying more GPUs.
- Streaming hides some latency. Optimize TTFB even if final time is longer. Users forgive ongoing generation more than blank screens.
- Constrained decoding is expensive. If you need JSON, consider partial constraints with a JSON validator fallback. It is often faster overall than strict token-level constraints.
- Long system prompts are budget killers. Move policy and formatting instructions into few-shot exemplars or tool contracts. Measure the token reduction. Every 100 prompt tokens matters.
Model strategy
- Distill or fine-tune a smaller model for the 80% path. Escalate to a larger model only on uncertain or high-value queries.
- Speculative decoding helps. A small draft model proposes tokens, a larger model verifies. Works well when you own the stack or your vendor supports it.
- Quantization is not free. INT8 often OK. INT4 can dent quality on long-context or math heavy tasks. Always A/B on your eval set, not a benchmark.
- Early-exit layers on encoder models can cut latency with minimal hit to quality, especially for re-rankers.
Orchestration patterns that help you hit SLOs
- Parallelize independent steps. Start retrieval, intent classification, and metadata fetch together. Join with a 150–200 ms soft timeout.
- Cascades by uncertainty. Route to higher-cost paths only when the cheap model is unsure. Keep the fallback budget smaller than the original call.
- Asynchronous refinement. Return a good-enough answer quickly and refine in the background. For enterprise search, swap in a re-ranked list a few hundred ms later.
- Dynamic quality tiers by user value. Logged-in power users or revenue-critical flows get the slow path if needed. Everyone else gets the fast path.
Failure modes to watch
- p50 looks great because you silently retry. Your users see 2x time. Instrument retries and show client-visible latency.
- Batching gone wrong. Micro-batching helps GPU utilization but can explode p95 if queues back up. Use adaptive queues with age-based drop.
- Context drift. A growing prompt becomes a dependency you never profile again. Put a token budget guardrail in CI.
Practical fixes that work
- Build a latency budget profiler. Per request, log time spent in: auth, feature fetch, embedding cache hit/miss, ANN search, re-rank, prompt build, model queue, TTFB, tokens/sec. Ship this as a dashboard. It stops arguments.
- Add an uncertainty score and route. Options: entropy from logits, agreement between small and medium models, retrieval overlap. Tune the threshold until your escalation rate is 10–20%.
- Keep a narrow context. Prefer 4–8 highly relevant chunks plus a short synthesis step over stuffing 30 chunks. Add a re-ranker if needed, not more context.
- Introduce a two-pass decode for structured tasks. First pass generates minimal fields. Second pass fills details if time remains.
- Cache at the right layer. Prompt-level caching for common system instructions. Response caching for deterministic or idempotent answers. Memoize tool calls with stable inputs.
- Contract your vendors around p95 and cold start. Negotiate hard on rate limits, burst capacity, and token-per-minute. Your incident rate is a function of their tail behavior, not their averages.
Business impact you can actually plan around
- Cost: Higher accuracy almost always means more tokens, more model calls, or bigger models. If your CAC depends on a 2 s response, every 500 ms you add will show up as lower conversion long before you see a line item on the cloud bill.
- Performance: Users remember the slow moments. Protect p95. A single slow hop in a 3-call chain is enough to tank trust and NPS.
- Scaling risk: Spiky traffic plus micro-batching equals p95 blowups. If your architecture only works when traffic is smooth, it will fail at launch events and Monday 9 AM spikes.
How to choose the right point on the curve
- Define minimum viable quality per flow. Not global. Define for chat, search, agent actions.
- Run a sweep. Vary model size, decoding params, context length, k, and re-ranker usage. Record accuracy on your evals with TTFB and p95.
- Plot your Pareto frontier. Pick the point that stays under your time budget with 20% headroom and meets quality. Ship that. Keep the next two candidates prewired for quick swap.
- Re-run monthly. Content changes. Traffic changes. Vendors change. Your frontier moves.
Key takeaways
- P95 is the real product metric. Optimize to the abandonment threshold, not p50.
- Fat prompts are latency debt. Reduce system tokens before upgrading hardware.
- Retrieval quality beats longer context. Cascade with re-rankers instead of stuffing.
- Use uncertainty routing and tiered models. Pay for accuracy only when it matters.
- Instrument end-to-end. You cannot fix what you cannot see, and retries lie.
If you are stuck
If you are fighting p95 while chasing quality, or your RAG accuracy flattened after you added more context, I have been in that hole. This is exactly the kind of thing I help teams fix when systems start breaking at scale. Happy to look at traces and budgets and give you a blunt read on where to regain speed without tanking quality.

