{"id":95,"date":"2025-07-18T14:23:45","date_gmt":"2025-07-18T14:23:45","guid":{"rendered":"https:\/\/angirash.in\/blog\/2025\/07\/18\/streaming-vs-batching-llm-systems\/"},"modified":"2026-04-10T19:06:07","modified_gmt":"2026-04-10T19:06:07","slug":"streaming-vs-batching-llm-systems","status":"publish","type":"post","link":"https:\/\/angirash.in\/blog\/2025\/07\/18\/streaming-vs-batching-llm-systems\/","title":{"rendered":"Streaming vs batching in LLM systems: how I decide in production"},"content":{"rendered":"<h2>The painful truth about streaming vs batching<\/h2>\n<p>If your chat UI feels snappy in the demo but falls apart under real traffic, you probably picked the wrong side in the streaming vs batching debate. I\u2019ve watched teams light up beautiful token-by-token streams and then get crushed by P95s, GPU underutilization, and random disconnects through their CDN. I\u2019ve also seen teams batch aggressively, brag about throughput, and then watch CSAT fall off a cliff because first token arrived 1.5 seconds later.<\/p>\n<p>Both camps are often right, just at the wrong layer of the stack.<\/p>\n<h2>Where this shows up and what teams miss<\/h2>\n<ul>\n<li>Interactive chat and copilots where perception is everything<\/li>\n<li>Tool-calling and agents where streaming is mostly noise until the tool result lands<\/li>\n<li><a href=\"https:\/\/angirash.in\/blog\/2025\/07\/14\/designing-low-latency-ai-systems-real-time\/\">Long-form generation and offline jobs where throughput beats UX<\/a><\/li>\n<li>RAG pipelines with big contexts where prefill dominates and streaming doesn\u2019t save you<\/li>\n<\/ul>\n<p>Why this happens in real systems:<\/p>\n<ul>\n<li>LLM serving has two phases with different bottlenecks: prefill vs decode. Prefill is memory bandwidth and attention cache build-up. Decode is compute-bound and batch friendly.<\/li>\n<li>Providers already micro-batch requests behind the API. Your client-side streaming might fight their scheduler.<\/li>\n<li>Network paths buffer by default. ALBs, CDNs, and reverse proxies quietly defeat \u201creal-time\u201d unless you tune them.<\/li>\n<li>Humans care about time to first useful token, not time to first byte. Those are not the same metric.<\/li>\n<\/ul>\n<p>What most teams misunderstand:<\/p>\n<ul>\n<li>Streaming is not free. It reduces effective batch size during decode and <a href=\"https:\/\/angirash.in\/blog\/2025\/09\/07\/accuracy-latency-tradeoff-ai-systems\/\">increases tail latencies under load<\/a>.<\/li>\n<li>Batching is not always cheaper. If it tanks engagement or increases abandonment, you pay more for retries and longer sessions.<\/li>\n<li>Token-by-token streaming is usually a vanity metric. 30-60 ms chunked streaming feels the same to humans and is friendlier to GPUs.<\/li>\n<\/ul>\n<h2>Technical deep dive<\/h2>\n<h3>What actually runs on the GPU<\/h3>\n<p>Every generation request has two phases:<\/p>\n<ul>\n<li>Prefill: process the full prompt and build KV cache. Cost scales roughly with context length squared without optimizations, but modern kernels use paged attention and reduce blowups. Still, long prompts hurt.<\/li>\n<li>Decode: generate tokens step by step using the cache. Highly batchable. Throughput can grow almost linearly with batch size until you hit memory or scheduler limits.<\/li>\n<\/ul>\n<p>Metrics that matter:<\/p>\n<ul>\n<li>TTFT: time to first token. Dominated by prefill and scheduler delays.<\/li>\n<li>TTFMT: time to first meaningful token. Better proxy for perceived UX.<\/li>\n<li>TTLT: time to last token. Dominated by decode and batch scheduling.<\/li>\n<li>Tokens\/sec at batch: measure single-request and at various batch sizes.<\/li>\n<\/ul>\n<h3>Schedulers and dynamic batching<\/h3>\n<p>Self-hosted stacks like vLLM, TGI, TensorRT-LLM do dynamic batching \u2013 they coalesce decode steps across requests. Speculative decoding helps but changes the balance less than people expect for long contexts.<\/p>\n<p>Key scheduler behaviors:<\/p>\n<ul>\n<li><a href=\"https:\/\/angirash.in\/blog\/2025\/07\/18\/gpu-vs-cpu-ai-cost-performance-tradeoffs\/\">Batch growth increases throughput<\/a> but raises TTFT due to queueing and straggler effects.<\/li>\n<li>Mixed-length requests create head-of-line issues. A single long output can stall a batch unless you enable interleaving.<\/li>\n<li>KV cache size drives memory pressure. Higher concurrency with big contexts forces evictions, which kill TTFT.<\/li>\n<\/ul>\n<h3>How streaming changes the queue<\/h3>\n<ul>\n<li>Streaming early and flushing per token keeps requests \u201chot\u201d on the GPU with small per-step batches.<\/li>\n<li>Under bursty traffic, micro-batches shrink and you lose 20-40 percent throughput compared to non-streamed or chunked streaming.<\/li>\n<li>More open connections mean more CPU, scheduler state, and egress. Backpressure bugs show up fast.<\/li>\n<\/ul>\n<h3>When batching wins hard<\/h3>\n<ul>\n<li>Long outputs and offline jobs. If you generate 2k+ tokens, batch decode at 8-16 can deliver 2-3x throughput vs streaming.<\/li>\n<li>Homogeneous workloads. Similar prompt sizes and output caps reduce stragglers and boost GPU occupancy.<\/li>\n<li>Internal pipelines. No user waiting, just finish fast and cheap.<\/li>\n<\/ul>\n<h3>Failure modes I keep seeing<\/h3>\n<ul>\n<li>SSE buffering through proxies. Nginx, ALB, Cloudflare often buffer chunks. Fix with proxy_buffering off, X-Accel-Buffering: no, flush timers, and TCP_NODELAY.<\/li>\n<li>Idle timeouts. Many CDNs cut SSE at 100-120 seconds. Long generations die mid-stream.<\/li>\n<li>CPU tokenization as bottleneck. Tokenizer on a single core caps you long before GPU gets busy.<\/li>\n<li>Cancellation leaks. Users close the tab, but your server keeps generating for 5 seconds because the provider ignores cancel or your code doesn\u2019t propagate it.<\/li>\n<li>Logprobs and JSON schema slowdowns. They drop decode TPS. Teams turn them on for debugging and never turn them off.<\/li>\n<li><a href=\"https:\/\/angirash.in\/blog\/2025\/08\/14\/why-your-rag-pipeline-is-slow-and-expensive\/\">RAG with giant contexts. TTFT spikes to 1-3 seconds<\/a>. Streaming does not help much because the pain is prefill, not decode.<\/li>\n<\/ul>\n<h2>Practical designs that work<\/h2>\n<h3>Quick decision rules<\/h3>\n<ul>\n<li>Interactive chat, output &lt; 300 tokens: stream, but chunk at 30-60 ms, not per token.<\/li>\n<li>Tool calling or multi-step agents: stream status events, not raw tokens, until tool results return. Then chunk-stream.<\/li>\n<li>Long-form writing, reports, batch doc processing: do not stream. Batch for throughput and predictability.<\/li>\n<li>Big-context RAG: invest in prefill reduction before arguing about streaming. Trim context, cache system prompts, use reranking, or use shorter passages.<\/li>\n<\/ul>\n<h3>Implementation patterns<\/h3>\n<p>1) Chunked streaming<br \/>\n&#8211; Flush every 30-60 ms or every 16-32 tokens, whichever comes first.<br \/>\n&#8211; Users perceive it as real-time. GPUs keep healthier batch sizes. This is my default.<\/p>\n<p>2) Non-streamed with progress<br \/>\n&#8211; Send progress events: retrieved docs, validated schema, tool steps, percent complete.<br \/>\n&#8211; Great for 1k+ token outputs or heavy tool chains where text streaming is noise.<\/p>\n<p>3) Hybrid endpoints<br \/>\n&#8211; For a \u201c\/chat\u201d endpoint, stream. For \u201c\/generate_report\u201d, batch. Routing at product layer, not one-size-fits-all.<\/p>\n<p>4) Admission control and fairness<br \/>\n&#8211; Cap concurrency per tenant. Reject or queue when KV cache crosses safe thresholds.<br \/>\n&#8211; Split pools: prefill pool and decode pool. Keeps TTFT consistent under burst.<br \/>\n&#8211; Enforce max output tokens server-side. Overlong generations are silent killers.<\/p>\n<p>5) Self-hosting specifics<br \/>\n&#8211; Use vLLM or TensorRT-LLM with interleaved decoding and paged attention.<br \/>\n&#8211; Warm KV caches for shared system prompts. Consider prompt caching across sessions.<br \/>\n&#8211; Pin tokenizer threads and profile CPU. Many \u201cGPU problems\u201d are tokenization.<\/p>\n<p>6) Provider usage specifics<br \/>\n&#8211; Providers already micro-batch. Streaming mainly affects UX and network, not compute cost, unless you lower throughput by forcing tiny batches via options.<br \/>\n&#8211; Turn off unnecessary features in prod: logprobs, detailed reasoning tokens, verbose tool traces.<br \/>\n&#8211; Implement real cancellation. Close the HTTP\/2 stream and call cancel on the SDK. Verify with traces that tokens stop quickly.<\/p>\n<p>7) Network hygiene<br \/>\n&#8211; For SSE: disable proxy buffering, set small flush intervals, add heartbeat every 10-15 seconds.<br \/>\n&#8211; Watch idle timeouts at CDN and ALB. Extend or bypass for long generations.<br \/>\n&#8211; If mobile heavy, consider WebSocket. SSE can be flaky on captive portals and some browsers.<\/p>\n<h3>Observability that saves hours<\/h3>\n<p>Instrument per request:<\/p>\n<ul>\n<li>TTFT, TTFMT, TTLT<\/li>\n<li>Prefill tokens, decode tokens, output tokens<\/li>\n<li>Decode tokens\/sec effective and at-the-scheduler<\/li>\n<li>Batch size per decode step, occupancy, straggler rate<\/li>\n<li>Cancellation count and time to cancel<\/li>\n<\/ul>\n<p>Alert on:<\/p>\n<ul>\n<li>TTFT P95 drift<\/li>\n<li>Batch size collapsing under load<\/li>\n<li>KV cache OOM or eviction spikes<\/li>\n<li>CDN idle close rates<\/li>\n<\/ul>\n<h2>Business impact you can plan around<\/h2>\n<p>Numbers I see repeatedly on A100\/H100 class GPUs with 7B-70B models:<\/p>\n<ul>\n<li>Token-by-token streaming vs 30-60 ms chunked: same perceived UX, 10-25 percent better throughput with chunking.<\/li>\n<li>Non-streamed batch decode of 8-16 vs streamed: 1.5-3x throughput increase on long outputs, but TTFT up by 200-800 ms.<\/li>\n<li>Big contexts dominate costs. Halving prompt tokens often improves TTFT more than any streaming tweak and reduces GPU memory pressure, enabling larger batches.<\/li>\n<\/ul>\n<p>Costs show up in three buckets:<\/p>\n<ul>\n<li>Compute: smaller effective batches waste GPU. Streaming everywhere can raise unit cost by double digits under load.<\/li>\n<li>Network and infra: open streams increase egress and connection count. Your ALB and CDN bills rise. Debug time rises too.<\/li>\n<li>Product: better TTFT raises engagement and completion rates. If CSAT and retention drive revenue, spend a bit more on compute to protect first token.<\/li>\n<\/ul>\n<p>Scaling risks:<\/p>\n<ul>\n<li>Streaming at peak without admission control collapses batch size and spikes P95. Users see jittery streams that restart.<\/li>\n<li>Batching without UX signals kills trust. Users think it is broken and retry, doubling load.<\/li>\n<\/ul>\n<h2>Key takeaways<\/h2>\n<ul>\n<li>Stream for UX, batch for throughput. Do not let either policy bleed across all endpoints.<\/li>\n<li>Chunk your streams. 30-60 ms flush or 16-32 token chunks feel real-time to humans and protect your GPU.<\/li>\n<li>Fix prefill first in RAG. Streaming does nothing if TTFT is bound by context processing.<\/li>\n<li>Separate pools or priorities for prefill and decode. It stabilizes TTFT under load.<\/li>\n<li>Implement hard limits and cancellation. Unbounded outputs and ignored cancels quietly burn money.<\/li>\n<li>Measure TTFMT, not just TTFT. Perception wins.<\/li>\n<li>Expect provider micro-batching. Your knobs may not behave like self-hosted.<\/li>\n<\/ul>\n<h2>If you\u2019re wrestling with this<\/h2>\n<p>If you\u2019re trading off UX and throughput, or your P95s go sideways during traffic spikes, this is exactly the kind of thing I help teams fix when systems start breaking at scale. Happy to review your serving stack, scheduler settings, and product flow, and give you a pragmatic plan that survives real users.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The painful truth about streaming vs batching If your chat UI feels snappy in the demo but falls apart under real traffic, you probably picked the wrong side in the&#8230; <\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[4],"tags":[17,24,15],"class_list":["post-95","post","type-post","status-publish","format-standard","hentry","category-genai-production","tag-ai-cost","tag-ai-latency","tag-ai-system-design"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/posts\/95","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/comments?post=95"}],"version-history":[{"count":1,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/posts\/95\/revisions"}],"predecessor-version":[{"id":149,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/posts\/95\/revisions\/149"}],"wp:attachment":[{"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/media?parent=95"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/categories?post=95"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/tags?post=95"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}