Designing low latency AI for real time: what actually works

The real problem with “real time” AI

Your p50 looks fine. Your users don’t care. They feel the p95.

I’ve walked into teams with a neat demo, then watched the production curve: 250 ms median, 1.8 s p95, random 5–7 s spikes. Voice agents turn awkward. Autocomplete drops keystrokes. Checkout flows hang while a model “thinks.” Product swears it’s a model problem. Infra blames the network. It’s actually a system problem.

Real time means you budget every millisecond from the client’s first byte to the last useful token. If you don’t design for tail latency, the system will design it for you.

Where latency pain shows up (and why)

Voice assistants: you need sub-120 ms time-to-first-byte and sub-300 ms for a turn to feel natural. Anything slower and users start talking over the bot.
Code completion: <100 ms from keystroke to suggestion. Otherwise the IDE feels sticky.
Customer chat or search: TTFB under 200 ms, partial content streaming, total under 1–2 s. If not, CSAT drops and humans pick up the slack.
Inline decisions (risk, routing, personalization): strict 50–150 ms budgets inside request chains. Your LLM is not the main event here.

Why this happens in real systems:

LLMs have two latencies: prefill (prompt ingestion) and decode (tokens/sec). Long prompts kill TTFB. Slow decode kills completion time.
RAG adds a second stack: vector search, rerankers, feature fetches, and often a tool call.
Model servers batch work for throughput. Under load, queueing and scheduling inflate tail latency.
Network reality: TLS handshakes, cold TCP, cross-AZ hops, NAT gateways, noisy neighbors.
Sprawl: you ship an “orchestrated” chain of 7 services, grabbag SDKs, and think tracing will fix it. It won’t.

What teams commonly misunderstand:

Optimizing averages instead of p95/p99.
Believing streaming alone fixes latency. It hides it. It doesn’t reduce TTFB.
Thinking a bigger model magically “pays for itself.” Smaller, closer, and warmed beats bigger and far.
Assuming vector DBs are free. Poor index choices and chatty clients can add 50–200 ms easily.

A practical latency model you can budget against

For a typical LLM app with RAG and streaming:

Total latency ≈ client RTT + gateway + retrieval + rerank + tool calls + model prefill + first-token delay + decode time

Time-to-first-byte (TTFB) ≈ client RTT + gateway + retrieval + rerank + model prefill + first-token delay
Completion time adds decode for N tokens at tokens/sec

If you don’t explicitly budget each term, something random will.

Architecture decisions that actually move the needle

1) Co-location and transport

Keep the whole hot path in one AZ. Model server, vector DB, feature store, and orchestrator should not cross-AZ. Cross-region for DR only.
Use HTTP/2 or gRPC with persistent connections. Preconnect from edge or service mesh. Kill cold handshakes in the hot path.
If you must go browser → server → model, make the server → model link long-lived and pinned.

2) Model server behavior

Choose engines that optimize for interactive latency: vLLM (continuous batching, KV cache management) and TensorRT-LLM are good defaults.
Continuous batching helps throughput but can hurt tails if you allow large max batch delays. Cap batch delay to 5–10 ms for interactive traffic.
Keep prompts short. Every extra 1k tokens adds tens to hundreds of ms of prefill, depending on hardware.
Quantization: INT4/8 can speed up decode on L4/A10G and reduce cost. Prefill sometimes slows slightly; measure your prompt length and hit rate.
Speculative decoding: small draft model + accept mechanism can net 1.2–1.6x decode throughput. Worth it for heavy decode workloads; less useful when TTFB dominates.
Exploit prefix/KV caching: stable system prompts and instruction preambles turn into real TTFB wins. Keep templates deterministic to raise hit rate.

Rough, directionally accurate numbers I’ve seen (don’t cargo-cult):

7B model on L4 INT4: 35–70 tokens/sec, prefill ~10–25 ms per 1k tokens
7B on A100 FP16: 70–140 tokens/sec, prefill ~6–12 ms per 1k tokens
13B roughly halves tokens/sec vs 7B on same gear

3) Retrieval that doesn’t eat your budget

Use approximate indexes that fit your recall target: HNSW or IVF-PQ. For interactive chat, K=20–50 is usually enough if your chunking is sane.
Rerankers are sneaky. A cross-encoder at 3–5 ms per pair becomes 100–250 ms at K=50. Options:
- Smaller reranker (MiniLM variants) and cap pairs to 20.
- Rerank in two tiers: cheap filter to 200, medium model to 20.
- Cache rerank results per (user, topic) where possible.
Keep vector DB and embedding model close to the LLM. One hop.
Batch retrieval calls per request. Don’t n+1 your way through metadata fetches.

4) Tooling and orchestration

Inline tool calls inside a user request need SLAs. If the tool can’t respond in <80 ms at p95, make it optional with a fast fallback.
Hedged requests for flaky dependencies: send a second request after 30–50 ms to a replica. Cap at one hedge to control cost blowups.
Admission control by class: interactive traffic in a priority queue separate from batch jobs. If you share GPUs, use MIG or isolated pools.
Circuit breakers short-circuit slow branches and send a degraded answer with an apology line. Users prefer fast and slightly less complete.

5) Prompt and output shaping

Shorten system prompts and suppress preambles. “Answer directly, no lead-in” removes 50–150 useless tokens and improves decode speed.
Stream immediately. Don’t wait for sentences. Use SSE or gRPC streaming. Clients should render partial tokens.
For voice, do partial ASR and TTS streaming. You can start speaking while the LLM finishes the tail.

6) Caching where it counts

Prompt prefix cache: exact-match KV cache. Works great when your instruction and policy prompts are stable.
Retrieval cache: hash queries + user segment → top-k doc IDs. TTL in minutes to hours depending on domain.
Semantic response cache: embed the normalized query and ANN-match previous answers within domain bounds. Good hit rates for support and search.
Don’t cache personally identifiable content unless you’ve segmented properly.

7) Observability for latency, not vibes

Track per-stage spans:

Client RTT, gateway time
Retrieval time, rerank pairs and ms
Model queue delay, prefill time, first-token delay, tokens/sec decode
Tool calls with p50/p95/p99

Set SLOs at p95, not p50. If you can’t see queueing vs compute time inside the model server, fix that first.

Latency budgets that work in practice

Here are concrete budgets I’ve used to hit real-time UX targets.

Voice assistant (turn-taking)

ASR: on-device or edge, partial hypotheses within 50–80 ms
NLU/LLM: TTFB <120 ms; prefer a 1–7B model, prompt under 800 tokens
Tool calls: only if p95 <60 ms, otherwise defer to next turn
TTS: start streaming within 80–120 ms; don’t wait for full text

Design note: keep ASR/LLM/TTS in one region, low jitter network. If you can’t, push ASR/TTS to edge and keep the LLM central with short prompts.

Code completion

Pre-trigger on pause and before newline to hide latency
1–7B model, quantized, context under 1k tokens
Total budget <100 ms for top-1; stream top-3 in 150–180 ms
Cache by file, project, and prefix length; decapitalize boilerplate prompts

Support chat with RAG

Retrieval + rerank budget: ≤70 ms p95
Model TTFB ≤150 ms; start streaming immediately
Total useful answer visible <600–900 ms, complete in 1.5–2.5 s
Use a small reranker and keep K small; precompute FAQs and high-traffic intents

Inline decisions (risk, policy)

Don’t use an LLM if a classifier works. 10–20 ms p95 on CPU beats everything
If you must, distill to a small 1–3B model, prompt under 300 tokens
Synchronous budget: 50–150 ms p95 end-to-end

Common failure modes I still see

Multi-hop tool chains that add 400–800 ms and fail half the time
Shared GPU pools where batch jobs starve interactive traffic
Vector DB in a different AZ because “it’s cheaper”
3rd-party model endpoints across regions with no hedging or preconnect
Rerankers run on CPU inside a container with no CPU pinning; tails go wild

Concrete fixes in priority order

1) Put everything hot in one AZ and pin connections. Measure again.
2) Shorten prompts. If your system prompt is longer than the user’s question, you’re burning TTFB.
3) Cap batch delay in the model server. Interactive pool with 5–10 ms max delay. Keep batch sizes reasonable.
4) Adopt a latency budget per use case and enforce it in code with timeouts and fallbacks.
5) Replace heavy rerankers with lighter ones and reduce K.
6) Add KV prefix caching and reuse stable templates.
7) Add QoS queues. Separate batch vs interactive. MIG if sharing GPUs.
8) Turn on speculative decoding if your workload is decode-heavy and your infra can afford the draft model.
9) Add hedged requests for flaky dependencies only. One hedge max.
10) Stream everything to the client and shape output to be useful early.

Hardware and model choices that respect latency

Hardware: L4s are cost efficient for 7–13B quantized and interactive loads. A100/H100 help with heavier decode and larger prompts. If tails matter more than cost, overprovision a bit and isolate tenants.
MIG or node-level isolation beats perfect utilization. You’re buying lower p95s.
Models: prefer small, fine-tuned models for real time. Use a quality ladder: serve small model fast, escalate to bigger model when confidence is low, optionally async.

Business impact

Latency is conversion. Cutting p95 from 1.8 s to 700 ms in a support chat reduced human handoffs by 12–20% on one team, which paid for dedicated GPUs.
Tail latency inflates cost. Timeouts trigger retries, duplicate tokens, and angry users who re-ask the same thing.
Hardware spend vs engineering time: a single region-local L4 cluster with good scheduling can outperform third-party endpoints riddled with cross-region hops, at a lower effective cost per successful interaction.
Caching is a margin lever. I’ve seen 15–35% token reduction with stable prompts, retrieval caching, and semantic caches, while also shaving TTFB.

Key takeaways

Budget latency per stage. TTFB is a first-class metric.
Put the hot path in one AZ and pin long-lived connections.
Keep prompts short. KV cache and prefix reuse are your friends.
RAG can be fast, but only if retrieval + rerank stays under ~70 ms.
Separate interactive from batch with QoS and, if needed, MIG.
Stream early, shape answers to be useful in the first 200 ms.
Measure p95/p99 per stage, not just end-to-end.
Small, tuned models win most real-time use cases. Escalate only when needed.

If this sounds familiar

If you’re staring at decent medians but angry users, or your stack became a Jenga tower of calls and caches, I can help. I work with teams to set latency budgets, fix tail behavior, and redesign hot paths so the system feels instant without blowing up costs. This is exactly the kind of thing I help teams fix when systems start breaking at scale.

Architect's Brief

Designing low latency AI for real time: what actually works

The real problem with “real time” AI

Where latency pain shows up (and why)

A practical latency model you can budget against

Architecture decisions that actually move the needle

1) Co-location and transport

2) Model server behavior

3) Retrieval that doesn’t eat your budget

4) Tooling and orchestration

5) Prompt and output shaping

6) Caching where it counts

7) Observability for latency, not vibes

Latency budgets that work in practice

Voice assistant (turn-taking)

Code completion

Support chat with RAG

Inline decisions (risk, policy)

Common failure modes I still see

Concrete fixes in priority order

Hardware and model choices that respect latency

Business impact

Key takeaways

If this sounds familiar

Category Name

Why Most RAG Architectures Break Under Real User Load

Why Your RAG System Retrieves the Wrong Data (and How to Fix It)

The real cost breakdown of running LLM apps on AWS

Recent Posts

Why Most RAG Architectures Break Under Real User Load

Why Your RAG System Retrieves the Wrong Data (and How to Fix It)

The real cost breakdown of running LLM apps on AWS

AI Observability: Stop Guessing, Start Instrumenting

Categories

Why Most RAG Architectures Break Under Real User Load

Why Your RAG System Retrieves the Wrong Data (and How to Fix It)

The real cost breakdown of running LLM apps on AWS

Why Most RAG Architectures Break Under Real User Load

Why Your RAG System Retrieves the Wrong Data (and How to Fix It)

The real cost breakdown of running LLM apps on AWS