Prompt Caching: How It Works
Prompt caching is the single biggest cost and latency optimization available to LLM users in 2026. It can reduce input-token billing by up to 90% and cut TTFT by an order of magnitude — but only if your prompt structure is designed for it. The mechanic is simple: the model caches the computed hidden state of repeated prompt prefixes, and on a cache hit, skips the expensive prefill computation entirely. The catch is that even tiny changes to the prefix invalidate the cache. This guide explains exactly what gets cached, when, and how to design prompts to maximize hit rate.
What is actually cached
An LLM running inference does two phases: prefill and decode. Prefill is the expensive part — the model reads every input token and computes the attention key/value (KV) tensors for it. These KV tensors are what subsequent decode steps reference to generate output. For a 10,000-token prompt, prefill processes 10,000 tokens in a parallel forward pass.
The cache stores those KV tensors keyed on the exact input bytes. On a cache hit, the server loads the KV tensors from fast storage (GPU memory or NVMe) instead of recomputing them. The output is identical; the work to produce it is skipped.
Implications:
- What is cached is precomputed model activations, not the raw input text.
- The cache is sensitive to byte-for-byte changes — any whitespace, ordering, or content change rewrites the KV tensors.
- Caches are tied to specific model versions. Switching from Claude Opus 4 to Claude Opus 4.5 invalidates all caches for that prompt.
Anthropic's prompt caching
Anthropic exposes prompt caching as an explicit feature. You mark which parts of the prompt should be cached using a cache_control annotation in the message structure.
messages = [
{
"role": "user",
"content": [
{
"type": "text",
"text": LONG_SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral"}
},
{
"type": "text",
"text": user_question
}
]
}
]
The cache_control marker on the first content block establishes a "cache breakpoint" — everything up to and including that block is cached. On subsequent requests with identical content up to the breakpoint, the cached state is reused.
Pricing
- Cache write: 125% of base input price (25% premium over uncached input).
- Cache hit: 10% of base input price (90% discount).
- Cache TTL: 5 minutes by default; 1 hour with
"type": "ephemeral", "ttl": "1h"at higher write cost.
The 90% discount on hits means that even with a 25% write surcharge, you break even after the prompt is reused twice. For a system prompt sent 1000 times per hour, the cost reduction is roughly 85% on the cached portion.
OpenAI's prompt caching
OpenAI implemented automatic prompt caching with no explicit annotation. The server detects when consecutive requests share a common prefix and serves them from cache. Hit rate is reported in the response.
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": LONG_SYSTEM_PROMPT},
{"role": "user", "content": user_question}
]
)
print(response.usage.prompt_tokens_details.cached_tokens)
Pricing
- No write surcharge (writes are billed at base input price).
- Cache hit: 50% of base input price (50% discount).
- TTL: typically 5-10 minutes, not configurable.
- No explicit cache control — caching happens automatically when a prefix matches a recent request.
OpenAI's caching is less powerful than Anthropic's (50% vs 90% discount), but easier to use — you do not need to restructure prompts to opt in. For OpenAI, the optimization is just "design your prompt with the long stable parts at the front."
The prefix-stability principle
Both implementations cache from the beginning of the prompt until a breakpoint. Anything BEFORE the breakpoint must be byte-for-byte identical across requests. Anything AFTER the breakpoint can vary freely.
The design principle: put your most stable, longest content at the beginning of the prompt; put your variable content at the end. Common structures:
┌─────────────────────────────────────────┐
│ System prompt (very stable, long) │ ← cached
│ Tool definitions (stable) │ ← cached
│ RAG context documents (semi-stable) │ ← cached
│ Few-shot examples (stable) │ ← cached
├────────── cache breakpoint ───────────┤
│ Chat history (changes per turn) │ ← prefill cost paid
│ Latest user message (always new) │ ← prefill cost paid
└─────────────────────────────────────────┘
For an agentic application that streams tool calls back and forth, the stable parts (system prompt, tool definitions) can stay cached across an entire conversation while the rapidly-changing parts (recent messages, tool results) pay full prefill cost.
What invalidates the cache
The cache is keyed on exact bytes before the breakpoint. Anything that changes those bytes invalidates the cache. Common mistakes:
- Embedding a timestamp. "Today is November 5, 2026" in the system prompt — invalidates every day at midnight.
- Embedding the user's name. "You are a helpful assistant for {user_name}" — invalidates per user, eliminating cross-user cache hits.
- JSON key ordering. Some JSON serializers do not preserve key order. Two structurally identical prompts may serialize to different bytes.
- Trailing whitespace. Adding or removing a trailing newline invalidates.
- Tool definition reordering. If your tool list is built from a dict iteration that does not preserve order, the byte sequence may differ between requests.
- Counters or session IDs. Including "Request #42" or session metadata anywhere before the breakpoint.
Audit fixes:
- Hash the bytes of the prefix that you intend to cache. Log the hash with every request.
- If hits should occur but the hash differs across calls you expect to match, diff the underlying strings.
- Move all dynamic content explicitly after the breakpoint.
The latency impact
Cache hits dramatically reduce TTFT. The prefill step that takes hundreds of milliseconds to seconds for a long prompt is replaced by a memory load that takes tens of milliseconds. Concrete numbers:
| Cacheable prefix length | Uncached TTFT | Cached TTFT | Speedup |
|---|---|---|---|
| 500 tokens | ~400 ms | ~250 ms | 1.6x |
| 5,000 tokens | ~1,200 ms | ~300 ms | 4x |
| 50,000 tokens | ~6,000 ms | ~400 ms | 15x |
| 200,000 tokens (full context) | ~25,000 ms | ~600 ms | 40x |
For applications with very long stable contexts (an entire codebase as RAG context, a large document being analyzed across many queries), the latency improvement from caching is the difference between "feels slow" and "feels instant."
The cost math
Calculate savings before adopting:
uncached_cost = N × (input_tokens × input_price)
cached_cost = first_call(input_tokens × write_price) + (N-1) × (input_tokens × hit_price)
For Anthropic with input_price = $3/M tokens, write_price = $3.75/M, hit_price = $0.30/M, 10,000 cacheable tokens, N=100 requests within TTL:
uncached = 100 × 10,000 × $3/1M = $3.00
cached = 1 × 10,000 × $3.75/1M + 99 × 10,000 × $0.30/1M
= $0.0375 + $0.297
= $0.335
A 9x cost reduction on the cacheable portion. Output tokens are billed normally; if output is small relative to the cached prefix, total savings approach the cached-portion savings.
When NOT to use prompt caching
- Single-shot prompts. If the prompt is never repeated within the cache TTL, you pay the write surcharge for nothing. Cache breakeven is typically 2 cache hits.
- Highly variable prompts. If even small parts of the prefix vary between requests, caches never hit.
- Prompts shorter than the cache minimum. Anthropic requires at least 1024 tokens (or 2048 for some models) before the cache breakpoint to be eligible. Shorter prompts cannot be cached.
- One-off content uploads. A long document being analyzed once does not benefit from caching unless multiple queries are made against it within the TTL.
Operational patterns
Single-tenant cache hit rate
Most caching opportunities are per-user, per-session. A chat session sends a stable system prompt + growing history. The system prompt stays cached for the duration of the session.
Multi-tenant shared cache
A SaaS product with a stable global system prompt + per-tenant customization. The global part is cached across all tenants in your organization. The per-tenant part starts after the cache breakpoint and is recomputed every request.
RAG with cached corpus
Retrieval-augmented generation typically rebuilds context per query, which defeats caching. The pattern that preserves caching: precompute and cache the entire corpus or a frequently-relevant subset; vary only the user query at the end. This works for "ask anything about this document" use cases but not for embedding-based retrieval that selects different chunks per query.
Agentic loops
An agent that runs many tool calls in a conversation. The system prompt + tool definitions stay cached for the duration of the agent loop, often 10-50 turns. The cache pays for itself within 2-3 turns.
Frequently Asked Questions
What is actually cached when an LLM caches a prompt?
The model's KV cache — the per-token hidden state computed during the prefill pass — is what gets stored. On a cache hit, the server loads the precomputed KV cache from fast storage instead of recomputing it from the input tokens. The tokens themselves are not what is cached; the computed activations are. This is why caching saves both compute time (no prefill) and money (the cached portion is billed at a discount).
How long does an LLM prompt cache last?
Anthropic's default is 5 minutes; a 1-hour TTL option is available at higher cost. OpenAI's cache TTL is typically 5-10 minutes and is automatic (no explicit option). The cache is keyed on the exact prefix bytes; any change invalidates it. Some providers also evict caches under memory pressure before the TTL expires. Plan around the documented TTL but treat caching as a best-effort optimization, not a guarantee.
What is the price discount for a cache hit?
Anthropic charges 10% of the base input token price for cache hits (90% discount) and 25% extra for the initial cache write. OpenAI charges 50% of the base input token price for cache hits (50% discount) with no explicit write surcharge. Both providers apply the discount only to cached input tokens; output tokens and uncached input tokens are billed normally. For repeating prompt prefixes, the savings compound rapidly — a system prompt that runs 1000 times saves $30+ per million tokens at Anthropic rates.
Does prompt caching work across users?
On Anthropic and OpenAI, caches are scoped to a single API organization but shared across all requests from that organization. Different end users in the same organization share cache hits on identical prompt prefixes. Caches are NOT shared across different organizations or across providers. For multi-tenant LLM applications, this means a shared system prompt across all tenants hits cache; per-tenant customizations after the cached prefix do not.
What invalidates a prompt cache?
Any change to the bytes before the cache breakpoint invalidates the cache. This includes whitespace differences, JSON key order changes, timestamp insertion, user names embedded in system prompts, and date-stamp instructions like "today is October 15". Place all dynamic content AFTER the cacheable section. The exact cache key is the byte-for-byte prefix; even a single character difference forces a fresh prefill.
Related Guides
More From This Section
All AI & LLM Networking Guides
LLM API latency, streaming, prompt caching, RAG, and inference architecture.
AI Inference: Edge vs Cloud
How to choose between on-device, edge-network, and centralized cloud inference — covering latency, bandwidth, privacy,…
Batching vs Streaming Tradeoffs
How static, dynamic, and continuous batching affect LLM throughput and per-request latency, and why streaming output is…
Run a Speed Test
Measure download, upload, ping, and jitter in your browser.