LLM Inference Latency Components
LLM inference looks like a single black box from the API side, but inside it splits cleanly into two phases with very different performance profiles. Prefill processes the prompt; decode generates the response. Each is bottlenecked by something different — compute for prefill, memory bandwidth for decode — and they scale with different inputs. Understanding the split is the foundation for every other engineering decision around LLM serving: caching, batching, hardware choice, prompt design.
The two-phase model
| Phase | Input | Output | Bottleneck | Scales with |
|---|---|---|---|---|
| Prefill | All prompt tokens at once | Initial KV cache + first token logits | Compute (FLOPs) | Prompt length |
| Decode | One token at a time + cached KVs | One new token per step | Memory bandwidth | Output length |
Time to first token (TTFT) is essentially prefill time plus network overhead. Time per output token (TPOT) is decode time per step. Total response latency is roughly TTFT + (output_tokens × TPOT).
Prefill: compute-bound
The prompt enters as a sequence of tokens. The model runs every attention layer over the full prompt in parallel — matrix multiplications that saturate the GPU's tensor cores. Output: the key and value tensors for every token at every layer (the KV cache), plus logits for the first output token.
Latency scaling: prefill cost grows roughly linearly with prompt length for short prompts (matmul throughput limit). For very long prompts, attention's quadratic complexity dominates and the curve bends upward. Models with optimized attention (sliding window, FlashAttention, sparse patterns) keep the linear regime longer.
Decode: memory-bandwidth-bound
Each decode step generates one token. The math: a single new query vector attends to the full KV cache, then a small feed-forward computes the next token's logits. The arithmetic is tiny — a handful of matrix-vector multiplications — but the entire KV cache must be read from GPU memory for the attention step.
For a 70B-parameter model with 32K context, the KV cache can be 10+ GB. Reading 10 GB per token at, say, 3 TB/s HBM bandwidth, takes ~3 ms. That's the floor on per-token decode latency on that hardware, regardless of how fast the compute units are.
The KV cache in detail
The KV cache stores intermediate attention state so each new token can attend to all prior tokens without recomputing. Its size:
KV cache bytes = 2 (K and V) × layers × heads × head_dim × seq_len × bytes_per_value × batch_size
For a typical 7B model, that's tens of megabytes per 1000 tokens. For a 70B model, hundreds of megabytes per 1000 tokens. Long contexts in large models can produce KV caches measured in tens of gigabytes — which is why context length is often gated by GPU memory, not by model capability.
Why TTFT and TPOT diverge across requests
- A 100-token prompt with 50-token response: TTFT small, TPOT × output dominates.
- A 10,000-token prompt with 50-token response: TTFT huge, TPOT × output is a tail.
- A 100-token prompt with 5000-token response: TTFT small, decode time dominates.
- A 10,000-token prompt with 5000-token response: both significant.
The same model serving the same workload can have wildly different latency depending on which combination dominates. Optimization strategies differ accordingly — long prompts benefit from prompt caching; long outputs benefit from speculative decoding or smaller models.
Prompt caching: bypassing prefill
If the same prompt prefix appears in many requests (system message, retrieved documents, few-shot examples), the KV cache for that prefix can be reused across calls. The server stores the KV cache after the first prefill, recognizes a matching prefix on subsequent requests, and starts decode from the cached state. The cost of prefill drops to essentially zero for the cached portion.
This is the largest single latency optimization available to most LLM workloads — see prompt caching how it works.
Speculative decoding: amortizing decode
If a small draft model can predict several tokens that the large model would have generated, the large model can verify all of them in one forward pass instead of one per token. Average effective TPOT drops by a factor of 2-5x depending on how often the draft is correct.
Network-side this changes nothing — the API still emits one token at a time on the wire. Server-side it lets the same hardware serve more requests at the same TPOT or the same request faster.
Continuous (in-flight) batching
Multiple requests at different stages can share GPU passes. A request still in prefill and another in decode go through the same kernel call. The KV caches are kept separate per request but the compute is amortized. This is what gives modern inference servers their throughput; it does not directly reduce single-request latency but it dramatically improves the throughput/latency curve.
Where network latency fits
Inference latency is the time at the server. From the user's perspective, total latency includes network round trips:
- TLS handshake to the API endpoint: ~50-200 ms one-time.
- First-byte network latency from server to client: ~20-100 ms per RTT.
- Streaming chunks: each token-or-batch carries network framing overhead.
For short outputs, network latency can be a significant fraction of total response time. For long streamed outputs, the network is a thin layer over server-side inference time.
Frequently Asked Questions
What is prefill in LLM inference?
Prefill is the first phase of LLM inference where the model processes the entire input prompt in parallel and builds the key-value (KV) cache. It is compute-bound — the GPU's tensor cores run hot — and its latency scales roughly linearly with prompt length. Time to first token (TTFT) is dominated by prefill time.
What is decode in LLM inference?
Decode is the autoregressive phase where the model generates output tokens one at a time. Each token requires reading the entire KV cache from memory, which makes decode memory-bandwidth-bound rather than compute-bound. Decode time scales linearly with the number of output tokens, with each token taking roughly the same time (time per output token, TPOT).
What is the KV cache?
The key-value cache stores intermediate attention computations for tokens already processed, so that each new generated token can attend to prior context without recomputing earlier attention. The KV cache grows linearly with sequence length and is the main reason long contexts use more GPU memory. Prefill builds the KV cache; decode reads from it.
Why is decode memory-bandwidth-bound?
During decode, the model generates one token at a time, but for each token the entire KV cache (often gigabytes for long contexts) must be read from GPU memory to compute attention. The computation per token is small relative to the memory read. The bottleneck is bandwidth, not arithmetic throughput, which is why high-bandwidth memory architectures dominate inference hardware.
How can prefill latency be reduced?
Several mechanisms: prompt caching reuses KV cache from previous identical prompt prefixes; speculative decoding lets a smaller draft model predict tokens that a larger model verifies in parallel; chunked prefill processes the prompt in smaller pieces; and longer-context models with optimized attention (sliding window, sparse) reduce per-token prefill cost.
Related Guides
LLM API Latency Explained
TTFT and TPOT from the client's perspective.
Prompt Caching
The mechanism that skips most of prefill on repeating prefixes.
Batching vs Streaming
How servers share GPU passes across concurrent requests.
Inference Server Architecture
The system that orchestrates prefill, decode, and batching.
More From This Section
All AI & LLM Networking Guides
LLM API latency, streaming, prompt caching, RAG, and inference architecture.
AI Inference: Edge vs Cloud
How to choose between on-device, edge-network, and centralized cloud inference — covering latency, bandwidth, privacy,…
Batching vs Streaming Tradeoffs
How static, dynamic, and continuous batching affect LLM throughput and per-request latency, and why streaming output is…
Run a Speed Test
Measure download, upload, ping, and jitter in your browser.