LLM Tokens, Bytes, and Bandwidth
LLMs bill in tokens; networks transmit bytes. The two are related but not identical, and the difference matters for cost forecasting, bandwidth planning, and multilingual applications. A 2000-token response is approximately 8 KB of model output — but the network sends 30-50 KB of JSON events to deliver it. This guide unpacks the token-to-byte translation, why some languages cost 2-3x more per word than others, and the bandwidth math for production LLM workloads.
What a token actually is
Tokens are the model's vocabulary units. They are not words or characters — they are byte sequences produced by a learned tokenizer that compresses common text patterns into shorter codes. Modern LLMs use byte-pair encoding (BPE) variants, which build a vocabulary by repeatedly merging the most common pair of adjacent byte sequences in training data.
The result: common English words ("the", "and", "of") become single tokens; rare words get split into multiple tokens; arbitrary byte sequences (random JSON, base64 strings, non-Latin scripts) fragment into many small tokens.
For Anthropic's Claude and OpenAI's GPT-4 family in 2026, the rules of thumb for English text:
- 1 token ≈ 4 characters
- 1 token ≈ 0.75 English words
- 1000 tokens ≈ 750 words ≈ 4000 characters ≈ a page of text
Bytes per token by content type
| Content type | Bytes per token (avg) | Notes |
|---|---|---|
| English prose | 3.5-4.0 | The reference baseline; tokenizers optimized for this |
| Python / JavaScript code | 3.0-3.5 | Lots of short syntax tokens (parens, operators, punctuation) |
| JSON / structured data | 2.5-3.0 | Punctuation-heavy; many single-character tokens |
| Chinese (Mandarin) | 2.0-3.0 | Each Chinese character is 3 bytes UTF-8 but is often 1-2 tokens |
| Japanese / Korean | 2.5-3.5 | Mixed scripts; less efficient than Latin |
| Arabic / Hebrew | 2.0-3.0 | Bidirectional text and ligatures fragment |
| Base64 data | 2.0-2.5 | Pseudo-random bytes; poor BPE compression |
| Random binary | 1.5-2.0 | Worst case; nearly byte-per-byte tokenization |
This is where the multilingual cost gap comes from. Tokenizers were trained predominantly on English text, so English text tokenizes efficiently — one token per 4 bytes typically. Non-English text often takes 2-3x more tokens for the same semantic content. Since LLM pricing is per token, the same information costs more to process in those languages.
The wire format: what bytes actually flow
When an LLM streams a response, each token (or small group of tokens) becomes a JSON event. The JSON overhead is substantial — often more bytes than the token content itself.
Example single token event from Anthropic SSE:
event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"Hello"}}
The actual content is 5 bytes ("Hello"). The event with formatting is ~100 bytes. Stream a 2000-token response one token at a time and you transmit roughly 200 KB of JSON to deliver 8 KB of actual text — a 25x overhead.
OpenAI's format is leaner per event (~70 bytes) and tends to batch 2-4 tokens per chunk, reducing overhead. But the pattern remains: streaming adds 5-10x bandwidth overhead vs the underlying content.
Why providers batch tokens into chunks
If providers sent every token as its own SSE event, bandwidth and per-event CPU cost would be high. To reduce this, most providers batch 2-4 tokens per chunk. The trade-off:
- Smaller chunks (1 token each): smoother visual streaming, finer-grained progress visibility, but higher overhead.
- Larger chunks (5-10 tokens each): less overhead, but choppy visual streaming — tokens appear in bursts.
Most production APIs land at 2-3 tokens per chunk as the visual compromise. For self-hosted gateways with very high concurrency, increasing this batch size reduces server-side load substantially.
Bandwidth planning for production workloads
Compute bandwidth requirements as:
bytes_per_request = (input_tokens × bytes_per_token_input)
+ (output_tokens × bytes_per_token_output_with_json_overhead)
bytes_per_token_input ≈ 4 bytes (English) to 8 bytes (non-Latin)
bytes_per_token_output_with_overhead ≈ 80-150 bytes (streaming format)
bandwidth = bytes_per_request × requests_per_second × 8 bits/byte
Example: a chat app with 100 concurrent users, average 1000-token prompts, 500-token responses, 1 request per minute per user:
upload_bytes/request = 1000 × 4 = 4000 bytes
download_bytes/request = 500 × 100 = 50,000 bytes (streamed)
requests/sec = 100 / 60 = 1.67/sec
upload_bandwidth = 4000 × 1.67 = 6.7 KB/sec = 54 Kbps
download_bandwidth = 50,000 × 1.67 = 83 KB/sec = 670 Kbps
Modest. Even a fairly large LLM application typically uses under 100 Mbps of total bandwidth. The bottleneck for LLM applications is rarely bandwidth — it is the token meter (cost) and the inference server's GPU capacity (latency).
Compression: when it matters
Most LLM API requests are small enough that compression overhead does not pay off. The exceptions:
- Long context uploads (10K+ tokens). Gzip a 40 KB prompt down to ~10 KB. Worthwhile if upload bandwidth is constrained or you are billed for ingress.
- Self-hosted gateway proxying. Inter-region or hybrid setups paying egress costs benefit from compression on long requests.
- Mobile clients. Mobile uploads at the edge of cell coverage may be bandwidth-limited; compression reduces upload time even if total bytes are modest.
Output streams are rarely worth compressing because streaming compression breaks incremental delivery. gzip needs to see enough bytes to make compression decisions; doing this per-chunk loses most of the compression benefit anyway.
The context window in bytes
A 200K-token context window for Claude or GPT-4 Turbo is roughly 800 KB of English text. Filling it once per request transmits 800 KB upload per call. Concrete examples:
| Context window | ~Bytes (English) | Tokens | Page-equivalent |
|---|---|---|---|
| 4K | 16 KB | 4,000 | ~3 pages |
| 16K | 64 KB | 16,000 | ~12 pages |
| 128K | 512 KB | 128,000 | ~100 pages / a short book |
| 200K | 800 KB | 200,000 | ~160 pages |
| 1M | 4 MB | 1,000,000 | ~800 pages / a long book |
Long-context applications that re-upload the full context every request can saturate bandwidth surprisingly quickly. A user re-asking questions against a 200K-token document at 10 questions per minute uploads 8 MB/min = 1 Mbps just for this one user. Prompt caching eliminates the re-upload cost — the context goes once, subsequent requests reference it by cache key.
Tokenizer differences between models
Different models use different tokenizers. Counts vary:
- OpenAI cl100k_base (GPT-4 family, 2023-): about 100K vocabulary, 3.5-4 bytes/token for English.
- OpenAI o200k_base (GPT-4o, GPT-5 family): about 200K vocabulary, slightly more efficient (3.8-4.2 bytes/token for English) and significantly better for non-English.
- Anthropic Claude tokenizer: similar overall efficiency to cl100k_base, with different exact token boundaries.
- Llama 3 tokenizer: 128K vocabulary, similar efficiency.
- DeepSeek / Qwen tokenizers: 100-150K vocabulary, optimized for Chinese — 2x more efficient than GPT-4 for Chinese text.
The exact token count for the same input text differs by ±10% between providers. Cost comparisons should normalize for this — provider A at $3/M tokens vs provider B at $3/M tokens is not the same price if provider B's tokenizer produces 15% more tokens for your typical content.
Practical token estimation in code
Use the provider's official tokenizer for accurate counts:
# Python with tiktoken (OpenAI)
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")
tokens = enc.encode(text)
print(len(tokens), "tokens")
# Python with anthropic tokenizer
from anthropic import Anthropic
client = Anthropic()
count = client.beta.messages.count_tokens(
model="claude-opus-4",
messages=[{"role": "user", "content": text}]
)
print(count.input_tokens, "tokens")
For quick estimation without API calls, the 4-bytes-per-token rule for English is accurate to ±15%. For non-English content, use the actual tokenizer — estimation will be wildly wrong otherwise.
Frequently Asked Questions
How many bytes is one LLM token on average?
For English text, one token is roughly 4 bytes (UTF-8 encoded). The rule of thumb is 1 token ≈ 0.75 words ≈ 4 characters. For Chinese, Japanese, and Korean, one token is typically 1-2 characters but each character is 3 bytes in UTF-8, so 2-4 bytes per token. For code, tokens average 3-4 bytes but include many short tokens for syntax. For random binary data or unusual Unicode, byte-per-token can drop to under 2 bytes — but those cases are rare in normal LLM traffic.
How does streaming affect actual bandwidth usage?
Streaming sends each token (or small group of tokens) as its own SSE event with JSON overhead of roughly 100-200 bytes per event. For frontier models generating 30 tokens/second, this is roughly 12-25 KB/s of network traffic, mostly metadata. For a 2000-token response that is 200-400 KB total — small compared to a typical web page. Total bandwidth is rarely the bottleneck for LLM applications; per-token JSON overhead is, but it can be reduced by buffering multiple tokens into one event.
Why are non-English languages more expensive per word?
Tokenizers were trained predominantly on English text, so English words compress efficiently — common words map to single tokens. Non-English languages, especially those with non-Latin scripts (Chinese, Japanese, Korean, Arabic, Thai), tokenize less efficiently. A Chinese paragraph may use 2-3x more tokens than the equivalent English meaning. Since LLM pricing is per-token, the same information costs 2-3x more to send in those languages. This is a real economic gap; choose tokenizers tuned for your target languages when possible.
Can I compress LLM API requests?
Yes — most providers accept gzip-encoded requests via the Content-Encoding header. For large prompts (10K+ tokens of text), gzip reduces request bytes by 60-80%. This saves upload bandwidth and reduces TLS handshake time slightly. Compression is most useful for self-hosted gateways that proxy large prompts; for typical short prompts, the compression overhead is barely worth it. Output is rarely compressed because streaming defeats incremental compression.
What is a context window in bytes?
A 200K-token context window is approximately 800 KB of English text (4 bytes per token average). For a single API call to fill the context window, that is 800 KB of upload bandwidth. With prompt caching, the same 800 KB is uploaded once and then referenced by hash for subsequent calls — keeping cache-aware bandwidth usage low even with very large contexts. Without caching, every request retransmits the full context.
Related Guides
More From This Section
All AI & LLM Networking Guides
LLM API latency, streaming, prompt caching, RAG, and inference architecture.
AI Inference: Edge vs Cloud
How to choose between on-device, edge-network, and centralized cloud inference — covering latency, bandwidth, privacy,…
Batching vs Streaming Tradeoffs
How static, dynamic, and continuous batching affect LLM throughput and per-request latency, and why streaming output is…
Run a Speed Test
Measure download, upload, ping, and jitter in your browser.