{"id":104,"date":"2025-07-14T10:22:35","date_gmt":"2025-07-14T10:22:35","guid":{"rendered":"https:\/\/angirash.in\/blog\/2025\/07\/14\/designing-low-latency-ai-systems-real-time\/"},"modified":"2025-07-14T10:22:35","modified_gmt":"2025-07-14T10:22:35","slug":"designing-low-latency-ai-systems-real-time","status":"publish","type":"post","link":"https:\/\/angirash.in\/blog\/2025\/07\/14\/designing-low-latency-ai-systems-real-time\/","title":{"rendered":"Designing low latency AI for real time: what actually works"},"content":{"rendered":"<h2>The real problem with \u201creal time\u201d AI<\/h2>\n<p>Your p50 looks fine. Your users don\u2019t care. They feel the p95.<\/p>\n<p>I\u2019ve walked into teams with a neat demo, then watched the production curve: 250 ms median, 1.8 s p95, random 5\u20137 s spikes. Voice agents turn awkward. Autocomplete drops keystrokes. Checkout flows hang while a model \u201cthinks.\u201d Product swears it\u2019s a model problem. Infra blames the network. It\u2019s actually a system problem.<\/p>\n<p>Real time means you budget every millisecond from the client\u2019s first byte to the last useful token. If you don\u2019t design for tail latency, the system will design it for you.<\/p>\n<h2>Where latency pain shows up (and why)<\/h2>\n<ul>\n<li>Voice assistants: you need sub-120 ms time-to-first-byte and sub-300 ms for a turn to feel natural. Anything slower and users start talking over the bot.<\/li>\n<li>Code completion: &lt;100 ms from keystroke to suggestion. Otherwise the IDE feels sticky.<\/li>\n<li>Customer chat or search: TTFB under 200 ms, partial content streaming, total under 1\u20132 s. If not, CSAT drops and humans pick up the slack.<\/li>\n<li>Inline decisions (risk, routing, personalization): strict 50\u2013150 ms budgets inside request chains. Your LLM is not the main event here.<\/li>\n<\/ul>\n<p>Why this happens in real systems:<\/p>\n<ul>\n<li>LLMs have two latencies: prefill (prompt ingestion) and decode (tokens\/sec). Long prompts kill TTFB. Slow decode kills completion time.<\/li>\n<li>RAG adds a second stack: vector search, rerankers, feature fetches, and often a tool call.<\/li>\n<li>Model servers batch work for throughput. Under load, queueing and scheduling inflate tail latency.<\/li>\n<li>Network reality: TLS handshakes, cold TCP, cross-AZ hops, NAT gateways, noisy neighbors.<\/li>\n<li>Sprawl: you ship an \u201corchestrated\u201d chain of 7 services, grabbag SDKs, and think tracing will fix it. It won\u2019t.<\/li>\n<\/ul>\n<p>What teams commonly misunderstand:<\/p>\n<ul>\n<li>Optimizing averages instead of p95\/p99.<\/li>\n<li>Believing streaming alone fixes latency. It hides it. It doesn\u2019t reduce TTFB.<\/li>\n<li>Thinking a bigger model magically \u201cpays for itself.\u201d Smaller, closer, and warmed beats bigger and far.<\/li>\n<li>Assuming vector DBs are free. Poor index choices and chatty clients can add 50\u2013200 ms easily.<\/li>\n<\/ul>\n<h2>A practical latency model you can budget against<\/h2>\n<p>For a typical LLM app with RAG and streaming:<\/p>\n<p>Total latency \u2248 client RTT + gateway + retrieval + rerank + tool calls + model prefill + first-token delay + decode time<\/p>\n<ul>\n<li>Time-to-first-byte (TTFB) \u2248 client RTT + gateway + retrieval + rerank + model prefill + first-token delay<\/li>\n<li>Completion time adds decode for N tokens at tokens\/sec<\/li>\n<\/ul>\n<p>If you don\u2019t explicitly budget each term, something random will.<\/p>\n<h2>Architecture decisions that actually move the needle<\/h2>\n<h3>1) Co-location and transport<\/h3>\n<ul>\n<li>Keep the whole hot path in one AZ. Model server, vector DB, feature store, and orchestrator should not cross-AZ. Cross-region for DR only.<\/li>\n<li>Use HTTP\/2 or gRPC with persistent connections. Preconnect from edge or service mesh. Kill cold handshakes in the hot path.<\/li>\n<li>If you must go browser \u2192 server \u2192 model, make the server \u2192 model link long-lived and pinned.<\/li>\n<\/ul>\n<h3>2) Model server behavior<\/h3>\n<ul>\n<li>Choose engines that optimize for interactive latency: vLLM (continuous batching, KV cache management) and TensorRT-LLM are good defaults.<\/li>\n<li>Continuous batching helps throughput but can hurt tails if you allow large max batch delays. Cap batch delay to 5\u201310 ms for interactive traffic.<\/li>\n<li>Keep prompts short. Every extra 1k tokens adds tens to hundreds of ms of prefill, depending on hardware.<\/li>\n<li>Quantization: INT4\/8 can speed up decode on L4\/A10G and reduce cost. Prefill sometimes slows slightly; measure your prompt length and hit rate.<\/li>\n<li>Speculative decoding: small draft model + accept mechanism can net 1.2\u20131.6x decode throughput. Worth it for heavy decode workloads; less useful when TTFB dominates.<\/li>\n<li>Exploit prefix\/KV caching: stable system prompts and instruction preambles turn into real TTFB wins. Keep templates deterministic to raise hit rate.<\/li>\n<\/ul>\n<p>Rough, directionally accurate numbers I\u2019ve seen (don\u2019t cargo-cult):<\/p>\n<ul>\n<li>7B model on L4 INT4: 35\u201370 tokens\/sec, prefill ~10\u201325 ms per 1k tokens<\/li>\n<li>7B on A100 FP16: 70\u2013140 tokens\/sec, prefill ~6\u201312 ms per 1k tokens<\/li>\n<li>13B roughly halves tokens\/sec vs 7B on same gear<\/li>\n<\/ul>\n<h3>3) Retrieval that doesn\u2019t eat your budget<\/h3>\n<ul>\n<li>Use approximate indexes that fit your recall target: HNSW or IVF-PQ. For interactive chat, K=20\u201350 is usually enough if your chunking is sane.<\/li>\n<li>Rerankers are sneaky. A cross-encoder at 3\u20135 ms per pair becomes 100\u2013250 ms at K=50. Options:\n<ul>\n<li>Smaller reranker (MiniLM variants) and cap pairs to 20.<\/li>\n<li>Rerank in two tiers: cheap filter to 200, medium model to 20.<\/li>\n<li>Cache rerank results per (user, topic) where possible.<\/li>\n<\/ul>\n<\/li>\n<li>Keep vector DB and embedding model close to the LLM. One hop.<\/li>\n<li>Batch retrieval calls per request. Don\u2019t n+1 your way through metadata fetches.<\/li>\n<\/ul>\n<h3>4) Tooling and orchestration<\/h3>\n<ul>\n<li>Inline tool calls inside a user request need SLAs. If the tool can\u2019t respond in &lt;80 ms at p95, make it optional with a fast fallback.<\/li>\n<li>Hedged requests for flaky dependencies: send a second request after 30\u201350 ms to a replica. Cap at one hedge to control cost blowups.<\/li>\n<li>Admission control by class: interactive traffic in a priority queue separate from batch jobs. If you share GPUs, use MIG or isolated pools.<\/li>\n<li>Circuit breakers short-circuit slow branches and send a degraded answer with an apology line. Users prefer fast and slightly less complete.<\/li>\n<\/ul>\n<h3>5) Prompt and output shaping<\/h3>\n<ul>\n<li>Shorten system prompts and suppress preambles. \u201cAnswer directly, no lead-in\u201d removes 50\u2013150 useless tokens and improves decode speed.<\/li>\n<li>Stream immediately. Don\u2019t wait for sentences. Use SSE or gRPC streaming. Clients should render partial tokens.<\/li>\n<li>For voice, do partial ASR and TTS streaming. You can start speaking while the LLM finishes the tail.<\/li>\n<\/ul>\n<h3>6) Caching where it counts<\/h3>\n<ul>\n<li>Prompt prefix cache: exact-match KV cache. Works great when your instruction and policy prompts are stable.<\/li>\n<li>Retrieval cache: hash queries + user segment \u2192 top-k doc IDs. TTL in minutes to hours depending on domain.<\/li>\n<li>Semantic response cache: embed the normalized query and ANN-match previous answers within domain bounds. Good hit rates for support and search.<\/li>\n<li>Don\u2019t cache personally identifiable content unless you\u2019ve segmented properly.<\/li>\n<\/ul>\n<h3>7) Observability for latency, not vibes<\/h3>\n<p>Track per-stage spans:<\/p>\n<ul>\n<li>Client RTT, gateway time<\/li>\n<li>Retrieval time, rerank pairs and ms<\/li>\n<li>Model queue delay, prefill time, first-token delay, tokens\/sec decode<\/li>\n<li>Tool calls with p50\/p95\/p99<\/li>\n<\/ul>\n<p>Set SLOs at p95, not p50. If you can\u2019t see queueing vs compute time inside the model server, fix that first.<\/p>\n<h2>Latency budgets that work in practice<\/h2>\n<p>Here are concrete budgets I\u2019ve used to hit real-time UX targets.<\/p>\n<h3>Voice assistant (turn-taking)<\/h3>\n<ul>\n<li>ASR: on-device or edge, partial hypotheses within 50\u201380 ms<\/li>\n<li>NLU\/LLM: TTFB &lt;120 ms; prefer a 1\u20137B model, prompt under 800 tokens<\/li>\n<li>Tool calls: only if p95 &lt;60 ms, otherwise defer to next turn<\/li>\n<li>TTS: start streaming within 80\u2013120 ms; don\u2019t wait for full text<\/li>\n<\/ul>\n<p>Design note: keep ASR\/LLM\/TTS in one region, low jitter network. If you can\u2019t, push ASR\/TTS to edge and keep the LLM central with short prompts.<\/p>\n<h3>Code completion<\/h3>\n<ul>\n<li>Pre-trigger on pause and before newline to hide latency<\/li>\n<li>1\u20137B model, quantized, context under 1k tokens<\/li>\n<li>Total budget &lt;100 ms for top-1; stream top-3 in 150\u2013180 ms<\/li>\n<li>Cache by file, project, and prefix length; decapitalize boilerplate prompts<\/li>\n<\/ul>\n<h3>Support chat with RAG<\/h3>\n<ul>\n<li>Retrieval + rerank budget: \u226470 ms p95<\/li>\n<li>Model TTFB \u2264150 ms; start streaming immediately<\/li>\n<li>Total useful answer visible &lt;600\u2013900 ms, complete in 1.5\u20132.5 s<\/li>\n<li>Use a small reranker and keep K small; precompute FAQs and high-traffic intents<\/li>\n<\/ul>\n<h3>Inline decisions (risk, policy)<\/h3>\n<ul>\n<li>Don\u2019t use an LLM if a classifier works. 10\u201320 ms p95 on CPU beats everything<\/li>\n<li>If you must, distill to a small 1\u20133B model, prompt under 300 tokens<\/li>\n<li>Synchronous budget: 50\u2013150 ms p95 end-to-end<\/li>\n<\/ul>\n<h2>Common failure modes I still see<\/h2>\n<ul>\n<li>Multi-hop tool chains that add 400\u2013800 ms and fail half the time<\/li>\n<li>Shared GPU pools where batch jobs starve interactive traffic<\/li>\n<li>Vector DB in a different AZ because \u201cit\u2019s cheaper\u201d<\/li>\n<li>3rd-party model endpoints across regions with no hedging or preconnect<\/li>\n<li>Rerankers run on CPU inside a container with no CPU pinning; tails go wild<\/li>\n<\/ul>\n<h2>Concrete fixes in priority order<\/h2>\n<p>1) Put everything hot in one AZ and pin connections. Measure again.<br \/>\n2) Shorten prompts. If your system prompt is longer than the user\u2019s question, you\u2019re burning TTFB.<br \/>\n3) Cap batch delay in the model server. Interactive pool with 5\u201310 ms max delay. Keep batch sizes reasonable.<br \/>\n4) Adopt a latency budget per use case and enforce it in code with timeouts and fallbacks.<br \/>\n5) Replace heavy rerankers with lighter ones and reduce K.<br \/>\n6) Add KV prefix caching and reuse stable templates.<br \/>\n7) Add QoS queues. Separate batch vs interactive. MIG if sharing GPUs.<br \/>\n8) Turn on speculative decoding if your workload is decode-heavy and your infra can afford the draft model.<br \/>\n9) Add hedged requests for flaky dependencies only. One hedge max.<br \/>\n10) Stream everything to the client and shape output to be useful early.<\/p>\n<h2>Hardware and model choices that respect latency<\/h2>\n<ul>\n<li>Hardware: L4s are cost efficient for 7\u201313B quantized and interactive loads. A100\/H100 help with heavier decode and larger prompts. If tails matter more than cost, overprovision a bit and isolate tenants.<\/li>\n<li>MIG or node-level isolation beats perfect utilization. You\u2019re buying lower p95s.<\/li>\n<li>Models: prefer small, fine-tuned models for real time. Use a quality ladder: serve small model fast, escalate to bigger model when confidence is low, optionally async.<\/li>\n<\/ul>\n<h2>Business impact<\/h2>\n<ul>\n<li>Latency is conversion. Cutting p95 from 1.8 s to 700 ms in a support chat reduced human handoffs by 12\u201320% on one team, which paid for dedicated GPUs.<\/li>\n<li>Tail latency inflates cost. Timeouts trigger retries, duplicate tokens, and angry users who re-ask the same thing.<\/li>\n<li>Hardware spend vs engineering time: a single region-local L4 cluster with good scheduling can outperform third-party endpoints riddled with cross-region hops, at a lower effective cost per successful interaction.<\/li>\n<li>Caching is a margin lever. I\u2019ve seen 15\u201335% token reduction with stable prompts, retrieval caching, and semantic caches, while also shaving TTFB.<\/li>\n<\/ul>\n<h2>Key takeaways<\/h2>\n<ul>\n<li>Budget latency per stage. TTFB is a first-class metric.<\/li>\n<li>Put the hot path in one AZ and pin long-lived connections.<\/li>\n<li>Keep prompts short. KV cache and prefix reuse are your friends.<\/li>\n<li>RAG can be fast, but only if retrieval + rerank stays under ~70 ms.<\/li>\n<li>Separate interactive from batch with QoS and, if needed, MIG.<\/li>\n<li>Stream early, shape answers to be useful in the first 200 ms.<\/li>\n<li>Measure p95\/p99 per stage, not just end-to-end.<\/li>\n<li>Small, tuned models win most real-time use cases. Escalate only when needed.<\/li>\n<\/ul>\n<h2>If this sounds familiar<\/h2>\n<p>If you\u2019re staring at decent medians but angry users, or your stack became a Jenga tower of calls and caches, I can help. I work with teams to set latency budgets, fix tail behavior, and redesign hot paths so the system feels instant without blowing up costs. This is exactly the kind of thing I help teams fix when systems start breaking at scale.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The real problem with \u201creal time\u201d AI Your p50 looks fine. Your users don\u2019t care. They feel the p95. I\u2019ve walked into teams with a neat demo, then watched the&#8230; <\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[4],"tags":[20,24,15],"class_list":["post-104","post","type-post","status-publish","format-standard","hentry","category-genai-production","tag-ai-infra","tag-ai-latency","tag-ai-system-design"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/posts\/104","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/comments?post=104"}],"version-history":[{"count":0,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/posts\/104\/revisions"}],"wp:attachment":[{"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/media?parent=104"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/categories?post=104"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/tags?post=104"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}