{"id":92,"date":"2025-08-12T10:37:21","date_gmt":"2025-08-12T10:37:21","guid":{"rendered":"https:\/\/angirash.in\/blog\/2025\/08\/12\/why-your-llm-response-time-is-inconsistent\/"},"modified":"2025-08-12T10:37:21","modified_gmt":"2025-08-12T10:37:21","slug":"why-your-llm-response-time-is-inconsistent","status":"publish","type":"post","link":"https:\/\/angirash.in\/blog\/2025\/08\/12\/why-your-llm-response-time-is-inconsistent\/","title":{"rendered":"Why your LLM response time is inconsistent"},"content":{"rendered":"<h2>The real reason your LLM is fast at 11 am and painful at 3 pm<\/h2>\n<p>You ship a chat feature. Median comes back in 800 ms in staging. In prod, users see 6 to 12 second stalls at random times. Dashboards say CPU looks fine. The provider status page is green. Your PM asks why it only happens during peak hours and sometimes only on long questions. You are not alone.<\/p>\n<p>I have seen this pattern across startups and public companies. Inconsistent latency is not one bug. It is usually five boring ones interacting. The good news is you can make the tail predictable without overpaying.<\/p>\n<h2>Where the problem shows up<\/h2>\n<ul>\n<li>Interactive chat and assistants where p95 matters more than median<\/li>\n<li>RAG workflows with multiple retrieval calls and tool hops<\/li>\n<li>Agents that chain model calls and silently explode tail latency<\/li>\n<li>Batch summarization where one slow shard holds the whole job hostage<\/li>\n<\/ul>\n<h2>Why it happens, and what most teams miss<\/h2>\n<p>Most teams think the model is slow. Sometimes it is. More often, the pipeline around the model creates variance. A typical path:<\/p>\n<p>Client -> API gateway -> auth -> rate limiter -> prompt assembly -> tool calls or retrieval -> model router -> batcher -> GPU prefill -> GPU decode -> stream -> safety -> post-process -> logging<\/p>\n<p>Variance accumulates at each hop. Key blind spots I keep seeing:<\/p>\n<ul>\n<li>No separation of time-to-first-byte vs time-to-last-token<\/li>\n<li>Queueing inside your own gateway or the provider&#8217;s batcher<\/li>\n<li>Long context and long outputs multiplied by dynamic batching<\/li>\n<li>Retry storms that turn a small blip into a 10x tail<\/li>\n<li>RAG fan-out. Ten 200 ms calls that sometimes become one 2.5 s outlier<\/li>\n<li>Logging and analytics done synchronously in the hot path<\/li>\n<\/ul>\n<p>On top of that, language and formatting change tokenization. A 1,000 character input in English might be 250 tokens. In German or with code blocks it can be 400 to 600 tokens. That directly hits prefill time. Output length does the same on decode.<\/p>\n<h2>Technical deep dive<\/h2>\n<h3>Where the time actually goes<\/h3>\n<p>Think of service time as prefill + decode + everything else.<\/p>\n<ul>\n<li>Prefill scales roughly linearly with input tokens. Long contexts and many attached files blow this up.<\/li>\n<li>Decode scales with output tokens and depends on scheduler batching and model architecture. Speculative decoding helps median, but if verification fails often, it can spike p95.<\/li>\n<li>Everything else includes network, TLS, auth, vector DB, tool APIs, safety filters, and your own post-processing.<\/li>\n<\/ul>\n<p>Your tail is a queueing problem. With variable service times, even small utilization increases explode p95 and p99. Little&#8217;s Law and M\/G\/1 queues are not optional here. If utilization crosses ~60 to 70 percent with high variance in job size, expect pain. Dynamic batching in the provider makes this worse. You get higher throughput but the slowest request in a batch drags everyone.<\/p>\n<h3>Provider side realities you rarely see documented<\/h3>\n<ul>\n<li>Dynamic batching windows: a 5 to 15 ms window adds a random wait before your job even touches a GPU<\/li>\n<li>Shared tenancy: your request can co-schedule with whales that have 100k token contexts<\/li>\n<li>MoE routing variance: certain inputs trigger more experts and add unpredictable compute<\/li>\n<li>KV cache behavior: long contexts that don&#8217;t fit cache lines lead to cache misses and re-materialization<\/li>\n<li>Safety and content filters: sometimes run on separate models, sometimes on CPU, sometimes inline<\/li>\n<\/ul>\n<h3>RAG and tool chains are latency amplifiers<\/h3>\n<ul>\n<li>Vector DB cold reads, index compaction, or segment merges will occasionally turn 50 ms into 800 ms<\/li>\n<li>Object storage and PDF parsing can have tail spikes in the seconds when done on-demand<\/li>\n<li>Calling multiple tools concurrently creates a race, and you always wait for the slowest if you do not design for early stop<\/li>\n<\/ul>\n<h3>Failure modes I see repeatedly<\/h3>\n<ul>\n<li>Client and server both retry on timeouts with similar backoff. You get synchronized retry storms<\/li>\n<li>Logging to your data warehouse inline. A backpressure event adds seconds to every request<\/li>\n<li>Auto-scaling tied to CPU or pod count, not queue depth or tokens in-flight, so it reacts too late<\/li>\n<li>NAT gateways become egress bottlenecks for provider calls, especially with HTTP\/1.1 and no keep-alive<\/li>\n<li>Timeouts longer than user patience, so the system keeps working while the user already bounced<\/li>\n<\/ul>\n<h2>Practical fixes that actually work<\/h2>\n<h3>Measure the right things first<\/h3>\n<p>Instrument end-to-end with these fields, per request:<\/p>\n<ul>\n<li>TTFB and TTLB<\/li>\n<li>Input tokens, output tokens, tokens per second<\/li>\n<li>Prefill time vs decode time (providers expose this in headers or usage APIs, otherwise estimate)<\/li>\n<li>Context length, model name, temperature, top_p, stop reason<\/li>\n<li>Number of tool calls, RAG latency breakdown, cache hit rate<\/li>\n<li>Queue wait time at your gateway and at the model router<\/li>\n<\/ul>\n<p>If you cannot see queue time versus compute time, you are tuning in the dark.<\/p>\n<h3>Shape the workload, not just the infrastructure<\/h3>\n<ul>\n<li>Cap max output tokens dynamically based on user intent. Most chat turns are fine with 256 or less. Long-form is an explicit mode<\/li>\n<li>Bucket traffic by context length. Route long contexts to a dedicated pool to protect short requests<\/li>\n<li>Disable retries on the client if the server already retries. Use jittered exponential backoff and a global cap<\/li>\n<li>Use streaming for interactive UX, but back it with a hard TTLB cutoff. If you see no token in 700 ms, fail fast or downgrade<\/li>\n<li>Make decoding more deterministic for latency critical paths. Lower temperature and avoid pathological sampling configs<\/li>\n<\/ul>\n<h3>Control batching and concurrency<\/h3>\n<ul>\n<li>For self-hosted inference, separate pools for interactive and batch. Keep interactive batch windows under 5 ms<\/li>\n<li>Concurrency control at the gateway with a queue size limit. When the queue is full, serve a fast failure or a cached answer<\/li>\n<li>Auto-scale on queue depth and tokens in-flight, not CPU utilization. Warm pools to avoid cold starts<\/li>\n<li>If you use a provider, ask for priority lanes or reserved capacity for your critical path<\/li>\n<\/ul>\n<h3>RAG specific hardening<\/h3>\n<ul>\n<li>Put strict time budgets on retrieval and tool calls. If retrieval exceeds 200 ms, degrade the context instead of waiting<\/li>\n<li>Cache aggressively. Query result caches and chunk caches remove 50 to 150 ms variance per call<\/li>\n<li>Keep top_k small. Over-retrieval adds latency and hurts answer quality anyway<\/li>\n<li>Precompute embeddings for hot content, and avoid on-demand OCR or PDF parsing in user flow<\/li>\n<li>Co-locate vector DB and app layer in the same region and VPC peering to avoid random egress hops<\/li>\n<\/ul>\n<h3>Networking and IO basics that save seconds at p99<\/h3>\n<ul>\n<li>Use HTTP\/2 or gRPC with keep-alives. Disable Nagle if your client library lets you<\/li>\n<li>Reuse TLS sessions. Avoid re-resolving DNS on every call<\/li>\n<li>Do not sync-write logs in the hot path. Buffer and batch to a sidecar or async worker<\/li>\n<li>Cap request size. Large payloads on mobile networks create wild TTFB swings<\/li>\n<\/ul>\n<h3>Guard your pipeline from itself<\/h3>\n<ul>\n<li>Circuit breakers around tool chains and external APIs<\/li>\n<li>Timeout hierarchy: shorter upstream, longer downstream, never the other way around<\/li>\n<li>Stop sequences and early stop for long-winded outputs<\/li>\n<li>Speculative decoding only if you measure real p95 improvement. It can produce great medians with worse tails if mis-tuned<\/li>\n<\/ul>\n<h3>When the provider is the problem<\/h3>\n<ul>\n<li>Run A\/B across providers or across regions of the same provider. Latency variance is not uniform<\/li>\n<li>Collect per-model, per-region p95 and p99 under your traffic shape, not synthetic 1 tps tests<\/li>\n<li>Prefer models with stable decode throughput over slightly smarter ones for interactive UX<\/li>\n<li>Ask for headers or usage fields that expose batch size, queue time, and cache hits. Some vendors will enable it if you push<\/li>\n<\/ul>\n<h2>Architecture patterns that keep tails tight<\/h2>\n<ul>\n<li>Split the product into two lanes: interactive and background. Interactive gets strict SLOs, small prompts, small outputs, strong limits<\/li>\n<li>Use a broker or lightweight router that tracks in-flight tokens and applies admission control<\/li>\n<li>For agents, preplan steps with a small model, then execute with a larger one. This avoids open-ended tool loops<\/li>\n<li>Introduce an answer-first pattern: stream a short summary quickly, then continue in the background if needed<\/li>\n<li>Keep a fallback small model that can answer baseline queries under 400 ms when the main model is congested<\/li>\n<\/ul>\n<h2>Trade-offs you should accept<\/h2>\n<ul>\n<li>Slightly worse median throughput in exchange for a much better p95. Your users do not care about your median<\/li>\n<li>More strict output caps in chat. Long answers should be explicit modes, not the default<\/li>\n<li>Extra observability cost. Storing token-level metrics pays back quickly in debugging time<\/li>\n<li>Paying for reserved capacity on your critical path. Cheaper than losing trust during peak traffic<\/li>\n<\/ul>\n<h2>Business impact when you ignore this<\/h2>\n<ul>\n<li>Conversions drop and support tickets climb. Users remember the bad seconds, not your best<\/li>\n<li>Retry storms waste tokens. I have seen 20 to 30 percent token cost inflation from retries and long outputs alone<\/li>\n<li>Over-scaling to mask tail variance burns money. You add pods, p95 stays bad, the bill goes up<\/li>\n<li>Roadmaps slow down. Teams fear adding features that touch the model because performance is fragile<\/li>\n<\/ul>\n<p>If you are selling to enterprises, they will measure your p95. A single bad quarter of latency can kill expansion deals.<\/p>\n<h2>Key takeaways<\/h2>\n<ul>\n<li>Separate prefill, decode, and queueing time. Optimize each with different levers<\/li>\n<li>Shape requests with caps, bucketing, and time budgets. Do not just scale hardware<\/li>\n<li>Instrument token-level metrics and queue depth. Without them, everything looks like the model<\/li>\n<li>Control batching and concurrency. Interactive traffic needs small windows and admission control<\/li>\n<li>RAG and tools need strict timeouts and caches. Fan-out without budgets is p99 chaos<\/li>\n<li>Streaming helps UX but does not excuse slow TTFB. Protect first token time aggressively<\/li>\n<li>Test providers under load that matches your reality. Green status pages do not mean stable tails<\/li>\n<\/ul>\n<h2>If this sounds familiar<\/h2>\n<p>If your dashboards look fine but your users complain about random 8 second stalls, you likely have a mix of queueing, batching, and RAG variance. I help teams instrument, shape load, and redesign the hot path so p95 gets boring again. If you want a second set of eyes on your pipeline, reach out.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The real reason your LLM is fast at 11 am and painful at 3 pm You ship a chat feature. Median comes back in 800 ms in staging. In prod,&#8230; <\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[4],"tags":[17,24,22],"class_list":["post-92","post","type-post","status-publish","format-standard","hentry","category-genai-production","tag-ai-cost","tag-ai-latency","tag-ai-observability"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/posts\/92","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/comments?post=92"}],"version-history":[{"count":0,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/posts\/92\/revisions"}],"wp:attachment":[{"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/media?parent=92"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/categories?post=92"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/tags?post=92"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}