Batching vs Streaming Tradeoffs
LLM serving has two terms that get confused constantly: batching and streaming. They sit at different layers and solve different problems. Batching is how the inference server packs requests onto the GPU to use it efficiently. Streaming is how the server delivers tokens to the client over the network as they're produced. You can have either, both, or neither. The throughput-vs-latency tradeoff happens almost entirely on the batching side; streaming changes only the wire-level delivery of tokens already generated.
Three batching strategies
| Strategy | How it works | Throughput | Per-request latency |
|---|---|---|---|
| Static batching | Wait for a fixed batch size, run the whole batch, then return | High when batch fills quickly | Bad — last request waits for first to finish |
| Dynamic batching | Wait up to N milliseconds for a batch to fill, then run | Tunable | Bounded queueing delay (the timeout) |
| Continuous batching | Requests join the running batch at each decode step; finished requests leave | Highest | Lowest — no waiting for batch boundaries |
Static batching is mostly historical. Dynamic batching is still common for shorter-context services. Continuous batching is the modern default — vLLM, TGI, and most production inference servers implement it.
Why batching improves throughput
During decode, generating one new token requires reading the model weights and the KV cache from GPU memory. The compute is tiny; the memory read dominates. Batching multiple requests amortizes that memory read across many concurrent token generations. The same data brought into the compute units feeds N parallel decodes instead of one.
The throughput gain is sublinear — going from batch 1 to batch 32 might give 6-10x throughput, not 32x, because some operations still scale per-request. But for a service trying to maximize GPUs per dollar, batching is the single largest lever.
Continuous batching in detail
A continuous-batching server runs the GPU in a tight loop:
- At iteration
t, the active set is some collection of requests in various states (some still in prefill, some in decode, some about to finish). - The server schedules one forward pass that handles all active work. Each request advances by one token-equivalent of compute.
- Requests that finished generation are removed from the active set; new requests waiting in queue join.
- Loop.
The effect: GPU utilization stays high (the batch is usually close to full), per-request latency stays low (no waiting for batch boundaries), and the server can absorb bursty workloads without dropping requests.
The batch-size sweet spot
Bigger batches improve throughput but consume more GPU memory (more KV caches in flight). The practical batch size for a given model on a given GPU is bounded by memory:
max_batch_size ≈ available_kv_cache_memory / per_request_kv_cache_size
For long-context workloads, per-request KV cache is large and max batch size is small. For short-context workloads, batch sizes of 32-128 are common. PagedAttention and similar techniques allow more efficient memory usage by allocating KV cache in small pages, increasing achievable batch size for variable-length workloads.
Streaming on the wire
Once a token is generated, it can be sent to the client immediately or accumulated until the full response is ready. Streaming choices:
- SSE (Server-Sent Events) — unidirectional, HTTP-based, the de facto LLM streaming standard. Works through CDNs and proxies.
- WebSocket — bidirectional, useful when client needs to send cancel signals or function-call results mid-stream.
- gRPC streaming — for service-to-service communication; not generally exposed to browsers.
- Chunked HTTP without SSE framing — simplest but no event-level structure.
For full coverage, see streaming LLM responses.
Combining batching and streaming
The two coexist trivially. The server runs continuous batching internally. For each request in the batch, as new tokens are produced, the server writes them to that request's HTTP response stream. From the client's perspective, the response streams smoothly; from the server's perspective, GPU time is fully utilized across many concurrent streams.
This is the standard production setup: continuous batching on the GPU, SSE streaming on the wire, one HTTP request per user inference call.
How batching affects user-perceived latency
TPOT under heavy batching is slightly higher than TPOT in a single-request batch — there's a small amount of overhead per concurrent request. For a workload with 100ms TPOT solo, batched TPOT might be 110-130ms at a moderate batch size. The user sees:
- TTFT roughly unchanged (prefill is mostly per-request).
- TPOT slightly higher than solo (the batching tax).
- Far higher concurrent capacity at the same hardware — the same server handles 50-100x more concurrent users.
For most workloads this trade is correct. Latency-critical use cases (real-time voice agents, sub-100ms turnaround) sometimes pay extra for lower batch sizes or dedicated capacity.
Speculative decoding and batching interaction
Speculative decoding generates draft tokens with a small model that the large model verifies in parallel. It interacts with batching:
- Each large-model iteration verifies multiple draft tokens for each request in the batch.
- If draft acceptance rate is high, effective TPOT drops significantly.
- If the batch is already throughput-saturated, speculative decoding provides less additional speedup because the large model is already busy.
It is most useful for low-batch-size, latency-sensitive serving where the GPU has spare capacity.
Frequently Asked Questions
What is continuous batching?
Continuous batching (also called in-flight or iteration-level batching) lets new requests join an in-progress batch at any decode step rather than waiting for the batch to finish. Each GPU iteration processes whatever set of requests are currently active. Throughput approaches the GPU's roof while per-request latency stays close to its single-request value. It is the dominant batching strategy in modern inference servers like vLLM and TGI.
How is batching different from streaming?
They are independent concerns. Batching is server-side: how the inference engine groups requests on the GPU to amortize work. Streaming is wire-side: how the server delivers the generated tokens to the client over the network. A server can batch many requests internally and stream each one's tokens to its respective client. The two terms are sometimes confused because both relate to throughput, but they operate at different layers.
Why does batching improve throughput?
During decode, each token requires reading the entire KV cache from GPU memory. The compute per token is small. Batching multiple requests means one memory read serves multiple parallel token generations — the bandwidth cost is amortized across the batch. Throughput scales sublinearly with batch size (you don't get 32x throughput from a batch of 32) but the improvement over batch-of-one is large.
What is the latency cost of batching?
With static or dynamic batching, individual requests may wait for a batch to fill or for an in-progress batch to complete before being scheduled. This adds queueing latency that did not exist with batch-of-one serving. Continuous batching eliminates most of this cost by allowing requests to join immediately. The remaining cost is per-token latency increase from GPU contention, which is small relative to single-request decode time.
Does streaming reduce time to first token?
No — TTFT is determined by prefill time, which happens before any token is available to stream. Streaming only changes how the already-generated tokens are delivered. Streaming reduces perceived latency for long outputs by letting the user see tokens as they're produced rather than waiting for the full response, but the first token always lands at the same server-side moment regardless of streaming.
Related Guides
Inference Latency Components
Prefill, decode, and why batching exploits the decode bottleneck.
Streaming: SSE vs WebSocket
The wire-side companion to server-side batching.
Inference Server Architecture
The system that implements continuous batching in practice.
Self-Hosted Inference Networking
Network considerations for batched inference at scale.
More From This Section
All AI & LLM Networking Guides
LLM API latency, streaming, prompt caching, RAG, and inference architecture.
AI Inference: Edge vs Cloud
How to choose between on-device, edge-network, and centralized cloud inference — covering latency, bandwidth, privacy,…
Context Window and Token Budgets
How context windows are measured, why long contexts cost more than proportional compute, the quality cliff with long…
Run a Speed Test
Measure download, upload, ping, and jitter in your browser.