{"id":34,"date":"2025-04-21T10:35:22","date_gmt":"2025-04-21T10:35:22","guid":{"rendered":"https:\/\/angirash.in\/blog\/2025\/07\/14\/llm-latency-in-production-what-actually-works\/"},"modified":"2026-04-09T23:26:34","modified_gmt":"2026-04-09T23:26:34","slug":"llm-latency-in-production-what-actually-works","status":"publish","type":"post","link":"https:\/\/angirash.in\/blog\/2025\/04\/21\/llm-latency-in-production-what-actually-works\/","title":{"rendered":"LLM Latency In Production: What Actually Works"},"content":{"rendered":"<h2>The spinner is lying to you<\/h2>\n<p>If your LLM app shows a typing effect in under 300 ms but p95 completes at 6 to 10 seconds, users feel the lag. I see this pattern in chat tools, copilots, and internal RAG portals. Teams ship something that looks fast, then watch adoption flatten because the tail is ugly and unpredictable. The common reaction is to buy a bigger model or switch providers. That rarely fixes it.<\/p>\n<p>The gap between first token and final token is where most products lose trust.<\/p>\n<h2>Where latency really comes from<\/h2>\n<p>This shows up anywhere you chain steps: retrieval, tools, safety checks, function calling, or multi-turn agents. The model is rarely the only culprit. The time budget is being burned by everything around it.<\/p>\n<p>Why it happens in real systems:<\/p>\n<ul>\n<li>Prompt bloat and long histories that explode context prep time and slow tokens per second<\/li>\n<li>Cold starts on servers, embedding models, or vector indexes<\/li>\n<li>Serialization, network hops, DNS, and TLS in the hot path<\/li>\n<li>Sequential tools that could be parallelized but are not<\/li>\n<li>Oversized models used for trivial steps like classification or guardrails<\/li>\n<li>No tail mitigation. One slow shard or region ruins your p95<\/li>\n<\/ul>\n<p>What most teams misunderstand:<\/p>\n<ul>\n<li>They optimize model tokens per second and ignore pre and post processing<\/li>\n<li>They treat latency as a single metric instead of a budget per stage<\/li>\n<li>They rely on provider marketing numbers that do not hold at load or during peak hours<\/li>\n<li>They assume streaming hides everything. It does not. Users feel p95 and cancel rates reflect it<\/li>\n<\/ul>\n<h2>Technical deep dive: end to end, not model-only<\/h2>\n<p>Think in a latency budget. Back into it based on an interaction target. Example budget for 2.5 second p95 chat turn:<\/p>\n<ul>\n<li>150 ms network and headers total<\/li>\n<li>300 ms retrieval and ranking<\/li>\n<li>150 ms prompt build and policy checks<\/li>\n<li>1,600 ms model to final token<\/li>\n<li>300 ms tools or post processing<\/li>\n<\/ul>\n<p>Simple formula to keep yourself honest:<\/p>\n<p>T_total = T_network + T_retrieval + T_prompt + T_model + T_tools + T_post<\/p>\n<p>And for the model path:<\/p>\n<p>T_model \u2248 T_startup + tokens_generated \u00f7 tokens_per_second<\/p>\n<p>Architecture trade offs you will feel:<\/p>\n<ul>\n<li>Streaming vs throughput. Continuous batching improves server throughput but can add tens of milliseconds to first token<\/li>\n<li>Bigger model vs accuracy. A 70B can add seconds with marginal acceptance gain over a tuned 7B to 13B for many tasks<\/li>\n<li>Quantization and TensorRT improve latency but can change output distribution. You might need more guardrails after<\/li>\n<li>Managed API vs self-host. APIs are easy and good p50, but p99 can drift when multi-tenant queues surge<\/li>\n<\/ul>\n<p>Common failure modes:<\/p>\n<ul>\n<li>Head of line blocking on a single GPU because one huge context request is hogging memory<\/li>\n<li>Agent loops. Unbounded function calling or reflection steps that run the clock out<\/li>\n<li>Index fan out. Hitting 4 vector stores and a SQL join in series instead of in parallel<\/li>\n<li>Cold autoscaling. New pods have to load tokenizers, weights, faiss indexes. Your p95 gets punished for minutes<\/li>\n<li>TLS and DNS churn. No connection pooling. No session resumption. You pay handshake tax repeatedly<\/li>\n<\/ul>\n<h2>Practical fixes that cut p95 without wrecking quality<\/h2>\n<p>These are the moves I use before touching model weights.<\/p>\n<h3>1) Set a hard budget and enforce it in code<\/h3>\n<ul>\n<li>Define p50, p95, p99 targets and allocate to stages<\/li>\n<li>Fail fast on budget overruns. Return partial with a fallback template instead of spinning for 8 seconds<\/li>\n<li>Add per stage timing in traces. You cannot fix what you cannot see<\/li>\n<\/ul>\n<h3>2) Pick the smallest viable model for each job<\/h3>\n<ul>\n<li>Main chat or generation model can be medium. Everything else should be tiny<\/li>\n<li>Use compact models for classification, safety, and extraction. 1 to 3 ms on CPU beats 300 ms on a giant remote model<\/li>\n<li>Cap max_tokens output. Most products never need more than 256 to 512 tokens per turn<\/li>\n<\/ul>\n<h3>3) Kill prompt bloat<\/h3>\n<ul>\n<li>System prompts longer than 1k tokens are a latency tax. Shrink and version them<\/li>\n<li>Compress history with model summaries. Keep only the last 2 to 4 raw turns. Summarize the rest with a checksum to avoid cache misses<\/li>\n<li>Use constrained decoding or JSON grammars to avoid meandering completions<\/li>\n<\/ul>\n<h3>4) Stream like you mean it<\/h3>\n<ul>\n<li>Stream from the model and to the client. Do not buffer server side to stitch tool results unless you must<\/li>\n<li>Split content and metadata channels. Users see progress while tools complete<\/li>\n<li>Always show a deterministic prefix quickly. Title, citations in progress, that sort of thing<\/li>\n<\/ul>\n<h3>5) Parallelize retrieval and tools<\/h3>\n<ul>\n<li>Fire vector search, keyword filters, and profile lookups in parallel with timeouts<\/li>\n<li>Use fan out with budget. Race two providers with a 150 ms hedge if p95 matters more than cost. Make requests idempotent<\/li>\n<li>If tools are slow, return a partial answer with placeholders and a follow up patch event<\/li>\n<\/ul>\n<h3>6) Cache everything that is safe to cache<\/h3>\n<ul>\n<li>Output cache keyed by normalized prompt with TTL. Works shockingly well for common tasks and templates<\/li>\n<li>Embedding cache. Use a stable normalizer and avoid accidental cache busting from whitespace<\/li>\n<li>Static RAG snippets. Pre-embed and pre-rank hot documents nightly<\/li>\n<\/ul>\n<h3>7) Put compute and data in the same place<\/h3>\n<ul>\n<li>Keep the LLM runtime, vector DB, and app in one region. Cross region RAG calls kill latency<\/li>\n<li>Use HTTP keep alive, HTTP 2 or gRPC, and connection pools. Enable TLS session resumption<\/li>\n<li>If you control the runtime, enable DNS caching and low TTLs only where needed<\/li>\n<\/ul>\n<h3>8) Choose a serving stack that respects interactivity<\/h3>\n<ul>\n<li>For self host: vLLM or TensorRT LLM with continuous batching and KV cache reuse<\/li>\n<li>Limit max batch delay to keep first token under 200 ms at p50<\/li>\n<li>Pin tokenizer to CPU and avoid reloading at request time<\/li>\n<li>Warm pools. Never scale to zero for interactive traffic<\/li>\n<\/ul>\n<h3>9) Use provider features that actually help<\/h3>\n<ul>\n<li>Speculative decoding and logit bias only if they do not destabilize outputs<\/li>\n<li>Function calling with parallel tool calls if available. Set hard timeouts per tool<\/li>\n<li>If a provider gives per token streaming speed data, watch it. Token rate sag is an early signal of throttling<\/li>\n<\/ul>\n<h3>10) Tail mitigation<\/h3>\n<ul>\n<li>Hedged requests. Send a duplicate to a second server after a small delay when p95 is bad. Expect 5 to 10 percent cost for a 30 to 50 percent p99 improvement<\/li>\n<li>Circuit breakers around flaky tools. Fall back to a cached or simpler path<\/li>\n<li>Per tenant quotas to prevent one noisy user from burning your GPU queue<\/li>\n<\/ul>\n<h3>11) Measure the right things<\/h3>\n<ul>\n<li>Track first token, time to usable answer, time to final, and dropped interactions<\/li>\n<li>Attribute cost and latency per stage in traces. Put budgets in logs so failures show which stage violated it<\/li>\n<li>Run soak tests at realistic concurrency. Latency curves at N=1 are fiction<\/li>\n<\/ul>\n<h3>12) Trim agents before they trim your SLOs<\/h3>\n<ul>\n<li>Cap max tool calls and reflection steps per turn<\/li>\n<li>Separate planning from execution. Run planning on a faster, smaller model<\/li>\n<li>For most business apps, two tool steps per turn is plenty. More often hurts UX<\/li>\n<\/ul>\n<h2>Real trade offs and numbers<\/h2>\n<ul>\n<li>Moving retrieval and policy checks to smaller local models commonly saves 200 to 600 ms p95 with no quality loss<\/li>\n<li>Prompt diet plus max_tokens hard caps often reduces model time by 25 to 40 percent<\/li>\n<li>Parallelizing two critical tools with a 1 second budget cut our p95 from 5.2 s to 2.8 s at a client, cost up 12 percent. They kept it because conversions jumped 18 percent<\/li>\n<li>Self hosting a 13B with vLLM on A100 40G, continuous batching, and KV cache gave us sub 200 ms first token and 1.2 to 1.8 s to 300 tokens at p50. p99 was sensitive to one long context request. We fixed it by setting a hard context max and routing long docs to an offline path<\/li>\n<\/ul>\n<h2>Business impact you can forecast<\/h2>\n<ul>\n<li>Latency is conversion. In assistive tools, every 1 second p95 improvement can lift task completion and retention. I have seen 10 to 20 percent lifts from getting under 3 seconds p95<\/li>\n<li>Cost goes down when you right size models and add caching, but can go up with hedging. Put a number on both. It is usually worth paying a small premium to kill p99 flakiness<\/li>\n<li>Scaling risk is tail risk. If you cannot keep p95 under budget at 5x concurrency in a load test, you will miss your SLOs in production. Plan capacity with heads of line and multi tenancy in mind<\/li>\n<\/ul>\n<h2>Key takeaways<\/h2>\n<ul>\n<li>Treat latency as a budget across stages, not a single number<\/li>\n<li>Shrink prompts and cap tokens before chasing exotic optimizations<\/li>\n<li>Parallelize retrieval and tools with budgets and timeouts<\/li>\n<li>Cache aggressively and keep compute close to data<\/li>\n<li>Pick the smallest model that works per step. Use tiny models for side tasks<\/li>\n<li>Stream early and often, but do not hide a broken p95 behind typing effects<\/li>\n<li>Invest in tail mitigation. Hedging and circuit breakers pay for themselves<\/li>\n<li>Measure first token, usable token, and final token separately in traces<\/li>\n<\/ul>\n<h2>If you need a hand<\/h2>\n<p>If your app looks fast but your p95 keeps creeping up, or you are juggling providers and still seeing stalls, I can help. This is the work I do for teams when interactive latency and reliability start slipping at scale.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The spinner is lying to you If your LLM app shows a typing effect in under 300 ms but p95 completes at 6 to 10 seconds, users feel the lag&#8230;. <\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[4],"tags":[17,20,24],"class_list":["post-34","post","type-post","status-publish","format-standard","hentry","category-genai-production","tag-ai-cost","tag-ai-infra","tag-ai-latency"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/posts\/34","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/comments?post=34"}],"version-history":[{"count":1,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/posts\/34\/revisions"}],"predecessor-version":[{"id":84,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/posts\/34\/revisions\/84"}],"wp:attachment":[{"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/media?parent=34"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/categories?post=34"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/tags?post=34"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}