LLM API Latency Explained
LLM latency does not behave like a normal API. A single "response time" number hides the most important property — that the first byte takes a long time, but subsequent bytes arrive quickly. Engineers who treat an LLM call as a black-box round-trip end up with poorly-tuned applications. The correct model is two numbers: TTFT (time to first token) and TPOT (time per output token). Together they determine total response time, but they have completely different optimization levers.
The two-number latency model
Total response time for an LLM call is approximately:
total_time ≈ TTFT + (output_tokens × TPOT)
Both terms matter, but they break down differently:
- TTFT (Time To First Token): The latency until the first character appears. Dominated by prefill — the model processing the entire input prompt to build its internal state — plus network and queueing.
- TPOT (Time Per Output Token, also called Inter-Token Latency / ITL): The average gap between successive tokens after the first. Dominated by the model's generation speed, which depends on model size, hardware, and batch position.
A response with 500 output tokens at 800 ms TTFT and 25 ms TPOT takes:
800 ms + (500 × 25 ms) = 800 ms + 12,500 ms = 13.3 seconds total
Two completely different engineering moves change each term. Optimizing for total time without distinguishing them leads to wrong decisions.
What goes into TTFT
| Component | Typical contribution | How to reduce |
|---|---|---|
| Client-to-API network round-trip | 30-150 ms | Use the API region closest to your origin; use HTTP/2 connection reuse |
| API gateway routing & auth | 10-50 ms | Reuse API keys; provider-side, not your control |
| Queueing on inference server | 0-2000 ms (varies wildly) | Use lower-traffic regions; provider-side scaling |
| Prefill (input tokens × per-token prefill cost) | 50-3000 ms | Use shorter prompts; use prompt caching |
| First-token generation | 10-50 ms | Use smaller model; provider-side |
For frontier models (Claude Opus, GPT-4o, Gemini Ultra) in 2026 with typical 1-5K token prompts, TTFT lands at 500-1500 ms in normal conditions and 2-5 seconds under load. Smaller models (Claude Haiku, GPT-4o-mini, Gemini Flash) come in at 200-800 ms. The biggest single variable is prompt length.
Why prompt length matters so much for TTFT
Prefill is the operation where the model reads the entire input prompt and computes hidden states for every token in it. This is a parallel operation across input tokens, but it still scales roughly linearly with input length. A 100-token prompt prefills in tens of milliseconds; a 50,000-token prompt prefills in seconds. The relationship is roughly:
prefill_time ≈ input_tokens × per_token_prefill_cost
Per-token prefill cost varies by model — frontier 70B+ parameter models cost more per input token than smaller models, but all of them are linear in input length. This is the hidden latency tax of long context windows.
What goes into TPOT
After prefill, the model generates output tokens one at a time. Each output step is an autoregressive decode — read the previous tokens, compute the next token, append. TPOT is the average time per decode step.
TPOT depends on:
- Model size. Larger models have more parameters to apply per step. A 70B model has roughly 2-3x the TPOT of a 7B model on the same hardware.
- Hardware. H100 GPUs have higher memory bandwidth than A100, which is the bottleneck for autoregressive decode. Modern inference is memory-bandwidth-bound, not compute-bound.
- Batch position. Inference servers batch requests. Decoding 16 sequences in a batch shares fixed costs, so each individual sequence's TPOT is similar to single-sequence inference but the throughput is much higher.
- Optimizations. Speculative decoding, KV cache compression, and continuous batching can substantially reduce TPOT.
Typical 2026 TPOT numbers for hosted APIs:
- Frontier models: 25-50 ms/token (20-40 tokens/sec).
- Mid-tier models: 15-30 ms/token (33-66 tokens/sec).
- Small/fast models: 8-15 ms/token (66-125 tokens/sec).
- Specialized fast variants (Groq, Cerebras, etc): 2-5 ms/token (200-500 tokens/sec).
How streaming changes user-perceived latency
Without streaming, the user waits for the entire response before seeing anything — perceived latency equals total time. With streaming (Server-Sent Events, see SSE vs WebSocket), tokens appear as soon as they are generated.
For a 500-token response that takes 13.3 seconds total:
- Non-streaming: User stares at a spinner for 13.3 seconds.
- Streaming: First token appears at 800 ms (TTFT). User sees output filling in at ~40 tokens/sec. They can start reading immediately.
The user-perceived latency with streaming is effectively just TTFT, because the user can begin consuming the response while it generates. This is why TTFT optimization disproportionately matters for user-facing applications and TPOT optimization matters more for batch / backend workloads.
The latency variance problem
LLM API latency is not a stable number. Even for the same prompt to the same model, response times vary by 2-5x between calls. The reasons:
- Provider-side queueing. When the provider's GPU pool is busy, your request waits before being included in a batch.
- Batch composition. Decode steps batch active sequences. Your request's TPOT depends on what else is in the batch when it executes.
- Region-level capacity. Some regions hit capacity during peak hours; provider routing decisions move requests to less-loaded regions, adding latency.
- Network path variability. Standard internet routing variance applies. A few hundred ms of public-internet jitter is normal.
Production systems must design for P95 or P99, not P50. A user-facing application that targets 1-second TTFT at P50 will hit 3-5 seconds at P99 — and P99 is what gets observed by the noisy customer who complains.
How to measure correctly
Treat TTFT and TPOT as separate metrics:
start = now()
stream = client.messages.stream(...)
for chunk in stream:
if first_token_time is None:
first_token_time = now()
TTFT = first_token_time - start
tokens_received += chunk.token_count
last_token_time = now()
TPOT = (last_token_time - first_token_time) / max(tokens_received - 1, 1)
total = last_token_time - start
Track:
- TTFT at P50, P95, P99.
- TPOT at P50, P95, P99.
- Total response time at the same percentiles.
- Output token count (the number actually generated, not requested).
- Tag every record with model, region, prompt-length bucket, and cache hit/miss.
P50 and P95 tell you typical behavior. P99 tells you what your worst users see. Capacity planning should target the P99 line.
Reducing TTFT
- Use prompt caching. Cache hits skip prefill entirely. TTFT drops from seconds to ~100 ms for the cacheable part. See prompt caching explained.
- Shorten the prompt. Every input token costs prefill time. Trim system prompts; trim retrieved context; use shorter examples.
- Pick a closer region. Network RTT adds up. Use the API region geographically closest to your servers.
- Use connection reuse. HTTP/2 keep-alive saves the TLS handshake (~50-150 ms) on every request after the first.
- Pick a smaller model for TTFT-sensitive paths. Smaller models prefill faster. Use a small model for first-impression latency; switch to larger models for follow-ups if needed.
Reducing TPOT
- Use a faster-tier provider. Groq, Cerebras, SambaNova, and other specialized inference hardware achieve 5-10x the TPOT of standard cloud APIs.
- Pick a smaller model. If quality requirements allow, smaller models stream faster.
- Reduce output length. Capped output tokens, terse system prompts, and structured-output schemas all reduce total tokens emitted.
- Speculative decoding (provider-side) can roughly double TPOT for models that support it.
Frequently Asked Questions
What is TTFT in LLM APIs?
TTFT stands for Time To First Token — the latency from sending a request to receiving the first token of the response. It is dominated by two components: prefill time (the model processing the entire input prompt to compute its hidden state) and routing time (queueing, load balancing, network round-trip). For frontier models with large prompts, TTFT is typically 500-3000 ms. Prompt caching dramatically reduces TTFT when applicable — cache hits skip prefill entirely.
What is TPOT and how does it differ from TTFT?
TPOT is Time Per Output Token (sometimes called inter-token latency or ITL) — the average time between successive tokens in the streamed response. While TTFT is the upfront cost, TPOT determines how fast text appears. For frontier models in 2026, TPOT ranges from 10-50 ms per token depending on model size and infrastructure. A 500-token response at 25 ms TPOT takes 12.5 seconds to fully stream, regardless of TTFT.
Does prompt length affect TTFT?
Yes, significantly. TTFT scales roughly linearly with input token count because the model must perform a prefill pass over the entire prompt before generating any output. A 100-token prompt prefills in tens of milliseconds; a 50,000-token prompt prefills in seconds. This is why long context windows have a hidden latency tax. Prompt caching mitigates this for repeated prompt prefixes — cached prefill is served from memory instead of recomputed.
Why does my LLM latency vary so much across requests?
LLM inference servers batch requests for throughput. When the server is busy, your request waits in a queue before being included in a batch; when the server is idle, your request starts processing immediately. P50 latency reflects unloaded conditions; P99 latency reflects queue depth. Provider-side capacity, traffic patterns, and your concurrent request rate all affect this. Provider SLAs typically guarantee P50 or P95, not P99 — design for the worst case.
How do I measure LLM latency correctly?
Measure both TTFT and TPOT separately, and track them at P50, P95, and P99. For TTFT, start the timer when you send the request and stop when the first token arrives. For TPOT, divide the time from first to last token by the number of output tokens. Record output tokens generated (not requested) to handle truncation correctly. Tag every measurement with model, region, and approximate prompt length so you can correlate latency variance to specific causes.
Related Guides
More From This Section
All AI & LLM Networking Guides
LLM API latency, streaming, prompt caching, RAG, and inference architecture.
AI Inference: Edge vs Cloud
How to choose between on-device, edge-network, and centralized cloud inference — covering latency, bandwidth, privacy,…
Batching vs Streaming Tradeoffs
How static, dynamic, and continuous batching affect LLM throughput and per-request latency, and why streaming output is…
Run a Speed Test
Measure download, upload, ping, and jitter in your browser.