Context Window and Token Budgets

The context window is the total number of tokens an LLM can attend to in a single call. Bigger windows look like more headroom but they cost disproportionately more in compute and memory, and quality on the largest windows often degrades for tasks that require precise recall from anywhere in the input. Production LLM applications need an explicit token budget — a deliberate allocation across system prompt, retrieved context, conversation history, and output — rather than a "fill it up until it errors" approach.

What gets counted

The context window covers everything the model sees plus everything it generates:

  • System prompt — your instructions to the model.
  • Conversation history — all prior user and assistant turns.
  • Tool / function results — every tool call's output is part of the conversation.
  • Retrieved context — RAG documents.
  • Current user message.
  • Reserved output capacity — the response itself counts against the window.

Total of all of the above must fit. If you have a 128K window and want a 4K response, your input budget is 124K.

Cost shape of long context

Input lengthPrefill time (rough)KV cache memory
1K tokens~100 ms~10 MB
10K tokens~1 s~100 MB
100K tokens~10-20 s~1 GB
1M tokensMinutes~10 GB

The numbers vary by model and hardware but the shape is consistent. Long contexts dominate prefill latency. Without prompt caching, every call with a long context pays this cost.

The lost-in-the-middle effect

Multiple studies show that LLM recall is best for information near the beginning or end of the context and worse for information in the middle. A specific fact placed at position 10 of 100K is harder for the model to retrieve than the same fact at position 5 or 95. This effect is partially mitigated in newer models but is not fully eliminated.

Practical implication: don't rely on the model finding critical information buried in the middle of a very long context. Put critical instructions at the beginning or end. For RAG, surface the most relevant retrieved documents first.

A budgeting template

For a 128K-window chat application with RAG:

ComponentAllocationTokens
System prompt~3%4K
Retrieved context~50%64K
Conversation history~25%32K
Current user message~5%6K
Output reserve~15%20K
Safety margin~2%2K

Numbers shift by workload. Pure-chat apps need more history budget; document-analysis apps need more retrieval budget. The discipline is naming each bucket explicitly rather than letting them silently compete.

Truncation strategies when over budget

  • Drop oldest conversation turns. Simple; works for most chat.
  • Summarize old history. Replace the oldest N turns with a short summary. Preserves more information than dropping but costs an extra LLM call.
  • Limit retrieval to top-k. Fewer documents at higher relevance often beats more documents at lower relevance.
  • Truncate retrieved documents. Use chunking with size limits; rank chunks not documents.
  • Switch to a larger-context model. When the workload genuinely needs more, model selection is the answer.

Tokenization on the client

To budget accurately, the client needs to know token counts before sending. Each model has a specific tokenizer; the same text produces different token counts in different models. Most APIs document their tokenizer and provide a client-side library (tiktoken for OpenAI, tokenizers libraries for Hugging Face models, etc.). Measuring exact token counts client-side avoids expensive trial-and-error against the API.

Tokens are not characters or words

English text averages about 4 characters per token, or ~0.75 tokens per word. Code, JSON, non-Latin scripts, and special characters tokenize differently — sometimes one character becomes multiple tokens. A 10K-character English document is ~2.5K tokens; the same character count of dense JSON might be ~3.5K tokens; the same character count of CJK text might be 5K+ tokens. Don't extrapolate from one workload to another.

Context window choice and pricing

Many providers price longer-context variants higher per token. Even at the same per-token price, longer contexts cost more per call because more tokens get processed. A long-context model used at 10K-token average input is more expensive than a short-context model used at the same length. Pick the window that fits your distribution, not the largest available.

When long context replaces RAG

Very large context windows (1M+ tokens) sometimes let applications skip retrieval entirely — just paste all candidate documents into the prompt. This works for small corpora (a single book, a few large reports) but quality degrades and cost balloons for anything larger. RAG remains the right pattern for medium-to-large corpora; long context is the right pattern for "I have a specific large document and want to ask about it."

Frequently Asked Questions

What is a context window?

The maximum total number of tokens (input + output) an LLM can process in a single inference call. Models advertise context windows of 8K, 32K, 128K, 200K, 1M, or more tokens. The window is shared between everything you send (system prompt, conversation history, retrieved documents, user message) and everything the model generates.

Why does long context cost more than proportional compute?

Attention complexity grows quadratically with sequence length in vanilla transformers. Optimized attention (FlashAttention, sliding window, grouped query) brings this closer to linear in practice but the constant factor is still large. A 100K-token prefill takes much more than 10x a 10K-token prefill, and the KV cache (proportional to length) consumes more memory.

What is the quality cliff with long contexts?

Empirically, model recall and reasoning quality degrade with very long contexts even when the model technically supports them. Information in the middle of a long context is often missed (the 'lost in the middle' effect). Models trained on shorter sequences may technically accept long ones but produce lower-quality output. Quality cliffs vary by model and task.

How should I budget tokens across components?

A reasonable starting allocation for a RAG application: system prompt 5-10%, retrieved context 50-70%, conversation history 10-20%, current user message 5%, output reserve 10-20%. Adjust based on workload — chat-heavy applications need more history budget; document-heavy applications need more retrieval budget.

What happens if I exceed the context window?

The API returns an error. The client must either truncate the input (drop old conversation, shorten retrieved documents), use a model with a larger window, or restructure the workflow (summarize history, chunk inputs across multiple calls). Detecting this proactively with a tokenizer before sending is much cheaper than receiving an error and retrying.

Related Guides

More From This Section