LLM API Cost Optimization

Q: How do I cap output tokens to control cost?

Set max_tokens in every request to a tight bound — 500 for short responses, 2000 for medium, 4000 for long. The provider stops generating once max_tokens is hit, even mid-sentence. Combined with stop sequences for structured output formats, this prevents the worst-case scenario of a runaway response. Output tokens are typically 3-5x more expensive than input tokens, so capping output has outsize cost impact.

LLM bills go from manageable to terrifying fast. Most teams overpay by 5-10x by using frontier models for tasks that smaller models handle fine, paying full price for prompts that could be cached, and running real-time API calls for jobs that could use the 50%-discounted Batch API. The optimization techniques are not exotic — they are well-known and provider-supported. This guide walks through them in order of impact, with the math for when each is worth the engineering effort.

The cost optimization hierarchy

Cost optimization works in this order, from highest impact to lowest:

Model routing. Use the smallest model that produces acceptable quality. 5-10x impact when applicable.
Prompt caching. Cache stable prompt prefixes. 5-10x impact on cacheable portion.
Batch API. 50% discount for non-realtime jobs. 2x impact.
Output length control. Cap max_tokens, use structured output. 2-3x impact on output cost.
Prompt compression. Trim system prompts, reduce few-shot examples. 1.5-2x impact.
Provider negotiation. Committed use discounts and enterprise tiers. 1.2-1.5x impact at scale.

Stacking the top 3 (model routing + caching + batch where applicable) commonly delivers 20-50x total cost reduction vs. naively using a frontier model in real-time for everything.

Model routing: the biggest single lever

Frontier models cost 10-20x more per token than mid-tier models, and 50-100x more than small models. Using a frontier model for every task is the most common cost mistake.

Task category	Best-fit tier	Example models (2026)
Complex reasoning, code generation	Frontier	Claude Opus, GPT-5, Gemini Ultra
Summarization, simple Q&A, classification	Mid-tier	Claude Sonnet, GPT-4o, Gemini Pro
Extraction, formatting, light classification	Small/fast	Claude Haiku, GPT-4o-mini, Gemini Flash
Pure pattern matching, no reasoning	Embedding + traditional	text-embedding-3-small, MiniLM

Implementation patterns:

Static model routing

Different endpoints in your application use different models. Marketing summarization uses Sonnet; legal review uses Opus; data formatting uses Haiku. Simple to implement; requires explicit per-feature decisions about quality vs cost.

Dynamic model routing (router LLM)

Use a small model to classify the request difficulty, then route to the appropriate model. The classification call adds latency and cost but pays back when most requests route to cheaper models. Several open-source routers exist (RouteLLM, LiteLLM).

Cascade pattern

Try the small model first. If its response indicates uncertainty or low confidence, escalate to the larger model. Most requests get answered cheaply; only hard ones pay the full cost.

Prompt caching: 5-10x on cacheable workloads

Covered in depth in prompt caching how it works. Summary cost math:

Anthropic: Cache hits at 10% of base price. Cache writes at 125%. Break-even after 2 reuses.
OpenAI: Cache hits at 50% of base price. Automatic; no write overhead.

For applications with stable system prompts (almost all production LLM apps), enabling prompt caching is the single highest-ROI engineering task. Implementation effort: a few hours. Cost reduction on the cached portion: 5-9x on Anthropic, 2x on OpenAI.

Patterns that benefit most:

Chatbots with stable system prompts (the prompt is cached across all users).
RAG applications with stable corpus (the corpus is cached; only the query varies).
Agent loops (the system prompt + tool definitions stay cached across many turns).
Per-user products where each user has a stable persona/context (per-user cache).

Batch API: 50% off for non-realtime work

Both Anthropic and OpenAI offer a Batch API with 50% discount on all token types in exchange for a 24-hour completion SLA. Submit a JSONL file of requests; receive a JSONL file of responses when ready.

Eligible workloads:

Nightly data processing. Classifying yesterday's customer feedback, enriching CRM records.
Historical backfill. Running a new prompt or model over an existing archive.
Embeddings for large corpora. Indexing thousands of documents.
Periodic reports. Weekly executive summaries from operational data.
Training data generation. Creating synthetic data for fine-tuning.

Ineligible workloads:

User-facing chat or interactive workflows.
Agent loops where step N depends on step N-1's response.
Any task with a sub-hour latency requirement.

The 50% discount stacks with prompt caching — but the caching benefit may be smaller in batch since requests in a batch are processed asynchronously and may not benefit from sequential cache hits.

Output length: the most-ignored cost lever

Output tokens are typically 3-5x more expensive than input tokens. For Claude Opus in 2026: input ~$15/M tokens, output ~$75/M tokens. A 5x ratio means every output token costs as much as five input tokens. Yet output length is often left uncontrolled.

Set max_tokens aggressively

Default max_tokens to the actual maximum useful response length, not the model's maximum capability. Most chatbot responses fit in 500-1500 tokens; structured output rarely needs more than 2000. Generous max_tokens (8000-16000) is rarely needed and just leaves the door open to runaway generation.

Use structured output schemas

Both Anthropic and OpenAI support JSON schema-constrained output. The model is forced to produce only the fields you specify. This both improves quality and dramatically shortens responses by eliminating preamble ("Here is the JSON you requested:") and verbose explanations.

Use stop sequences

For multi-turn formats or specific output structures, set stop sequences that terminate generation as soon as the useful content is complete. Example: stop=[""] for tagged output.

Concise system prompts

"Respond concisely." in the system prompt reliably reduces output length by 20-40%. Add it whenever appropriate. For chatbots, also instruct: "Do not restate the question or add filler."

Prompt compression: shorter prompts, same task

Long system prompts and few-shot examples are common but often bloated. Compression techniques:

Audit for redundancy. Many system prompts repeat the same instruction in three ways. Pick the clearest version; drop the rest.
Fewer few-shot examples. 3-shot usually performs nearly as well as 5-shot. 1-shot is sometimes enough if the example is well-chosen.
Reference documentation in retrievable form. Instead of stuffing entire policy documents into the prompt, use RAG to retrieve relevant sections per query. The remaining prompt is small.
Compress with another LLM. Run a small model to summarize a long context document into the key 500 tokens. Use the summary instead.

Don't compress at the cost of quality. Test rigorously after compression; subtle prompt changes can shift behavior.

The cost monitoring dashboard you actually need

To optimize you must measure. Track at minimum:

Total spend by day, by model, by endpoint. Identifies what to optimize.
Cache hit rate per endpoint. Tells you whether caching is working.
Average input tokens, output tokens, total tokens per request. Catches drift over time.
Cost per request distribution (P50, P95, P99). Long-tail expensive requests dominate spend.
Cost per user/tenant. Allocates costs and identifies abusive users.
Cost per feature. Tells you what features are cost-effective vs not.

Both OpenAI and Anthropic expose usage in API responses (the usage object). Log it for every call along with feature/endpoint/user tags. Aggregate in your observability stack or a dedicated tool like Helicone, Langfuse, or Portkey.

Negotiated pricing and committed use

At sufficient volume, both major providers offer:

Volume discounts. Typically 5-15% off list price for committed monthly spend.
Provisioned throughput. Dedicated capacity with predictable latency and a fixed monthly fee. Anthropic's Provisioned Throughput, OpenAI's Provisioned Throughput Units. Pays off when sustained utilization is high enough that the dedicated capacity beats per-token billing.
Enterprise tiers. Customized terms, SLAs, support, and pricing.

Provisioned throughput becomes economically interesting around $50K+/month sustained spend. Below that, the on-demand per-token model is more flexible and usually cheaper.

Putting it together: a worked example

A SaaS application processes 10 million prompt tokens and 2 million output tokens per day on Claude Opus. Baseline cost at list prices ($15/M input, $75/M output):

daily_cost = (10M × $15/M) + (2M × $75/M)
           = $150 + $150 = $300/day = $9,000/month

Optimizations:

Model routing. Audit reveals 60% of requests are summarization that Sonnet handles equally well. Sonnet costs $3/M input, $15/M output (5x cheaper).
```
routed_cost = 0.4 × $300 + 0.6 × ($150 × 1/5 + $150 × 1/5)
            = $120 + $36 = $156/day = $4,680/month  (-48%)
```

Add prompt caching. 80% of input tokens are cacheable system prompts; cache hit rate steady-state is 90%.

cached_input_cost = 0.8 × (10M × $15/M × 0.1) [hits] + 0.2 × 10M × $15/M
                  = $12 + $30 = $42/day vs original $150 input
                  (Sonnet portion similar reduction)
saving on input: roughly 70% of input cost
new_cost ≈ $4,680 × 0.5 = $2,340/month (-74% total)

Move 30% of requests to Batch API (the ones that are backend processing without realtime needs):

batch_portion_savings = 0.3 × 0.5 = 15% additional saving
new_cost ≈ $2,340 × 0.85 = $1,989/month (-78% total)

Cap output tokens at 1500 instead of unlimited. Output volume drops 20%.

further saving: roughly 10% of total
new_cost ≈ $1,989 × 0.9 = $1,790/month (-80% total)

From $9,000/month to $1,790/month — a 5x cost reduction without any quality loss for the intended use cases. The engineering effort: a few weeks of work across the four changes. Annual savings: $86,000.

Frequently Asked Questions

What is the biggest cost lever for LLM applications?

Prompt caching, typically. Anthropic offers a 90% discount on cached input tokens; OpenAI offers 50%. For workloads with repeated prompt prefixes (system prompts, RAG context, agent loops), prompt caching alone reduces total cost by 50-80%. The second-largest lever is the batch API (50% discount) for any work that does not need real-time response. Together, batch + caching can reduce cost by an order of magnitude vs naive per-request real-time API use.

When should I use the Batch API vs the regular API?

Use Batch API whenever response time can tolerate hours instead of seconds. Both Anthropic and OpenAI offer 50% discount on Batch API requests with up to 24-hour completion windows. Good fits: nightly data processing, bulk classification of historical data, generating embeddings for large corpora, periodic report generation. Bad fits: anything user-facing, agentic workflows where the next step depends on the response, real-time chat.

Does using a smaller model save more money than caching?

Sometimes — depends on workload. Switching from a frontier model to a mid-tier model is typically a 5-10x cost reduction. Caching is usually a 2-5x cost reduction on the cacheable portion. If your prompts are non-repetitive, model routing wins; if they have stable prefixes, caching wins. The optimal strategy combines both: route to the smallest model that produces acceptable quality, AND cache the system prompt within that model.

How do I cap output tokens to control cost?

Set max_tokens in every request to a tight bound — 500 for short responses, 2000 for medium, 4000 for long. The provider stops generating once max_tokens is hit, even mid-sentence. Combined with stop sequences for structured output formats, this prevents the worst-case scenario of a runaway response. Output tokens are typically 3-5x more expensive than input tokens, so capping output has outsize cost impact.

Should I switch providers to save money?

Switch only if the cost difference is meaningful and the quality difference is acceptable for your use case. Headline per-token prices vary by ~30% between Anthropic, OpenAI, and Google for comparable-tier models; this is usually less than the savings available from caching, batching, and model routing within a single provider. Multi-provider strategies add operational complexity (multiple API contracts, rate limit pools, different SDKs) that often outweighs the savings unless you have specific reasons like vendor diversification or specific feature needs.

Run a Speed Test

Related Guides

Prompt Caching

The biggest single cost lever, explained in detail.

LLM Tokens, Bytes, and Bandwidth

Token efficiency by language and content type.

LLM Rate Limits

Batch API has much higher rate limits than realtime.

Self-Hosted LLM Networking

When per-token cost is no longer the right model.