LLM Rate Limits and 429 Handling

LLM APIs are GPU-bound services with hard capacity limits. Unlike a typical web API that can transparently scale by adding servers, LLM providers enforce strict per-account rate limits because GPU inventory is finite. Hitting those limits returns HTTP 429 errors, and bad retry behavior at scale can make outages worse instead of better. This guide explains TPM and RPM, how to size for bursty workloads, the correct backoff algorithm, and the queue patterns that keep applications smooth under load.

The two rate-limit dimensions: TPM and RPM

Every major LLM provider enforces two parallel limits:

  • TPM (Tokens Per Minute): Total input + output tokens summed across all requests in a rolling 60-second window. The binding constraint for applications with large prompts.
  • RPM (Requests Per Minute): Count of requests, regardless of token size. The binding constraint for applications with many small prompts.

You hit a 429 when EITHER limit is exceeded — both apply simultaneously. Typical 2026 default tier limits per account/model:

TierAnthropic Claude OpusOpenAI GPT-4o
Free / trial50K TPM, 50 RPM30K TPM, 500 RPM
Tier 1 (paid)200K TPM, 4,000 RPM30K TPM, 500 RPM
Tier 2-3400-1.2M TPM, scaling RPM450K-800K TPM, 5K-10K RPM
EnterpriseCustom, often unlimitedCustom, often unlimited

Limits scale with usage history and account tier. Both providers auto-promote accounts to higher tiers based on consistent paid usage over weeks.

How rate limits actually behave

The implementation is usually a sliding-window or token-bucket algorithm. A token-bucket version:

  • The bucket has capacity = your TPM (or RPM).
  • Tokens regenerate at rate = TPM / 60 per second.
  • Each request consumes its token count immediately at submission, but the count of output tokens is added as they are generated.
  • When the bucket is empty, requests get 429.

The output-token mechanic is the subtle one. A streaming request that estimates 500 input + 1000 output tokens consumes 500 tokens immediately and the remaining as the response generates. This means a long streaming request can drain the bucket gradually, causing new requests to be 429-d midway through.

What a 429 response looks like

The standard 429 from Anthropic:

HTTP/2 429
retry-after: 12
anthropic-ratelimit-tokens-remaining: 0
anthropic-ratelimit-tokens-reset: 2026-05-25T14:32:18Z
anthropic-ratelimit-requests-remaining: 2

{"error":{"type":"rate_limit_error","message":"Rate limit exceeded: tokens per minute"}}

OpenAI is similar:

HTTP/2 429
x-ratelimit-remaining-tokens: 0
x-ratelimit-reset-tokens: 12s
retry-after: 12

{"error":{"type":"rate_limit_exceeded","message":"Rate limit reached..."}}

Key signals:

  • Retry-After header (in seconds or as an HTTP date) — the server's recommendation for when to retry. Use this when present.
  • Ratelimit-Remaining headers — how much of each limit you have left. Visible on every response, not just 429s. Use them to back off pre-emptively.
  • Ratelimit-Reset headers — when the rolling window resets.
  • Error type — distinguishes TPM exhaustion from RPM exhaustion. Helpful for diagnosing which limit you are hitting.

Exponential backoff with full jitter — the correct algorithm

The standard backoff algorithm for LLM retries:

retry_count = 0
base_delay = 1.0  # seconds
max_delay = 60.0
max_retries = 6

while retry_count <= max_retries:
    response = call_llm()
    if response.status == 200:
        return response
    if response.status == 429:
        retry_after = response.headers.get('retry-after')
        if retry_after:
            delay = float(retry_after)
        else:
            # Full jitter: random between 0 and exponential ceiling
            ceiling = min(base_delay * (2 ** retry_count), max_delay)
            delay = random.uniform(0, ceiling)
        sleep(delay)
        retry_count += 1
        continue
    if response.status >= 500:
        # Same backoff for transient server errors
        delay = random.uniform(0, base_delay * (2 ** retry_count))
        sleep(delay)
        retry_count += 1
        continue
    raise Exception(f"Unrecoverable: {response.status}")

raise Exception("Max retries exceeded")

Key properties:

  • Honor Retry-After when the server provides it. The server knows when capacity frees up; your guess is worse than its answer.
  • Full jitter, not partial jitter. The random number is uniform across the entire exponential window, not partway through. This spreads retrying clients evenly across the recovery window, avoiding thundering herd.
  • Exponential ceiling, not exponential delay. The cap is the ceiling of the random range; you might wait short or long depending on the dice. With many retrying clients, this maintains evenness.
  • Cap the delay. Never sleep longer than max_delay (typically 60 seconds). Beyond that, give up and surface the failure to the caller.
  • Cap retries. 6 retries gives the API a minute or two to recover; beyond that, the API is genuinely down and your application should fail gracefully.

The thundering herd problem

Without jitter, every client that gets 429-d at the same moment retries together a fixed delay later. They all hit the API simultaneously and immediately 429 again. The API never recovers because every retry wave is synchronized.

This is a classic distributed systems issue. The fix is the random jitter — by spreading retries randomly across the recovery window, the API gets a chance to clear its backlog smoothly.

Partial jitter (a fixed delay plus a small random window) is better than no jitter but still creates a "peak then ebb" pattern. Full jitter (random across the whole window) produces a uniform retry distribution and is the standard recommendation from AWS Builders' Library and the original "Backoff and Jitter" article.

Pre-emptive rate limiting on your side

Reactive 429 handling — only backing off after a 429 — works but leaves performance on the table. A better pattern: track your own request rate, throttle BEFORE the provider does.

Implementation patterns:

  • Token bucket on your side. Maintain a local rate limiter with capacity equal to your provider TPM/RPM. Block requests when the bucket is empty rather than sending them and getting 429.
  • Watch Ratelimit-Remaining headers. Every successful response includes these. If remaining tokens drop below a threshold, throttle preemptively until the reset time.
  • Distributed coordinator. For multi-instance applications, use Redis or a similar shared counter so different instances do not collectively overshoot the limit.

This adds complexity but significantly smooths behavior under load. The provider sees a clean, even traffic pattern instead of bursts followed by collapse.

Queue patterns for bursty workloads

Many LLM applications have bursty traffic — a marketing campaign triggers 100x normal load, an automated job kicks off 10,000 requests in 10 seconds, a webhook delivers 5,000 events at once. The queue is the right tool.

Simple queue + worker pool

Requests go to a queue. A fixed pool of N worker processes pull from the queue and submit to the LLM API at a sustainable rate. The queue absorbs bursts; workers consume them at the steady-state rate.

worker_count = min(TPM / average_total_tokens, RPM) / 60
# (per-second steady-state throughput)

Trade-off: requests wait in the queue when there is a burst. Tune queue size to the maximum acceptable user wait time. Beyond that limit, return an error immediately rather than queueing further.

Priority queue

For mixed workloads (interactive user-facing requests + batch background jobs), use a priority queue. Interactive requests get priority; batch jobs fill remaining capacity. This trades batch latency for predictable interactive latency.

Provider-side batch API

Both OpenAI and Anthropic offer asynchronous Batch APIs at 50% discount with up to 24-hour completion windows and substantially higher rate limits. For workloads that do not need real-time response, batch is the cost-and-throughput winner. Submit thousands of requests; receive results when they complete.

What to retry, what to fail fast

StatusAction
429 Too Many RequestsRetry with backoff. Honor Retry-After.
500, 502, 503Retry with backoff. Transient server error.
504 Gateway TimeoutRetry once with shorter prompt. Sometimes a model congestion signal.
529 Overloaded (Anthropic)Retry with backoff. Provider-side capacity issue.
400 Bad RequestDo NOT retry. Fix the request.
401 UnauthorizedDo NOT retry. Fix the API key.
403 ForbiddenDo NOT retry. Policy violation or quota exhausted.
413 Payload Too LargeDo NOT retry. Shrink the prompt.

Critical: never retry 400-class errors. They indicate a problem with your request, not transient server load. Retrying them wastes API calls and money on a guaranteed-failing request.

Observability for rate-limit health

Metrics to track:

  • Request rate (requests/sec, both sent and successful).
  • 429 rate as a percentage of total requests. Healthy: under 1%. Concerning: over 5%.
  • Retry count distribution. Most successful requests should succeed on the first try; the long tail is your operational headroom.
  • Ratelimit-Remaining trend per minute. Trending toward zero signals capacity exhaustion.
  • Queue depth and oldest-item-age if using a queue. Growing queue means insufficient sustained throughput.

Alert on:

  • 429 rate above 5% sustained for 5+ minutes (you are undersized for current demand).
  • Queue depth growing without recovery (capacity is structurally insufficient, not a burst).
  • Average retries per request above 1.5 (chronic 429-ing).

Frequently Asked Questions

What is the difference between TPM and RPM rate limits?

TPM (Tokens Per Minute) limits the total tokens — input plus output — across all requests in a rolling 60-second window. RPM (Requests Per Minute) limits the count of requests regardless of token size. Both apply simultaneously; you hit a 429 when either limit is exceeded. TPM is the binding constraint for applications with large prompts; RPM is the binding constraint for applications with many small prompts. Provider-side, TPM is usually the harder limit to raise because it directly maps to GPU capacity.

What is the right backoff strategy for LLM 429 errors?

Exponential backoff with full jitter. On the first 429, wait random(0, base_delay) seconds. On the next 429, double the upper bound: random(0, base_delay * 2). And so on, capped at a maximum (typically 60 seconds). Random jitter prevents thundering herd — without jitter, all retrying clients hit the API simultaneously after each backoff window. If the response includes a Retry-After header, use that value directly instead of computed backoff; the server knows when capacity will free up.

Do streaming responses count against TPM in real time?

The output tokens count toward TPM as they are generated, not at request submission. This means a long-running streaming request can push the token meter past the limit mid-stream, and subsequent new requests will be 429-d even though the current one continues. For high-throughput applications, account for in-flight output tokens when deciding whether to submit new requests.

How do I size a request queue for an LLM application?

Calculate sustainable steady-state throughput as min(TPM / average_total_tokens, RPM) requests per minute. Add a queue to absorb bursts up to your tolerable wait time — if you can tolerate users waiting 30 seconds during a burst, queue depth = 30 seconds × max_request_rate. Beyond that, drop or pre-emptively rate-limit at your application. Critical: monitor queue depth as a metric and alert before it grows unbounded.

Can I burst above my LLM rate limit?

Some providers allow short bursts above the steady-state limit (often 1.5-2x for 1-2 seconds) before strict enforcement kicks in. This is implementation-dependent and not contractually guaranteed. Design for the documented limit; treat any burst headroom as a hidden capacity gift, not a planning input. The provider can tighten enforcement at any time.

Related Guides

More From This Section