Guozhen AIGlobal AI field notes and model intelligence
Back to AI decision guides

AI operations

LLM rate limits guide: design around TPM, RPM, queues, and retries

A practical guide to LLM API rate limits across OpenAI, Anthropic, Azure OpenAI, Bedrock, and Gemini: TPM, RPM, retry-after, backoff, queues, batching, fallbacks, and throughput planning.

Updated 2026-06-119 min readIntermediate

Best for

  • Developers scaling LLM APIs beyond prototypes
  • SaaS teams planning traffic spikes and customer quotas
  • Platform engineers building retry, queue, and rate-limit middleware
  • Product teams that need graceful degradation instead of 429 errors

Not for

  • Hardcoded current limit tables that become stale
  • Retrying every request until the provider is overloaded
  • Ignoring output tokens and long-running tool calls

Comparison

Choose by workflow, not brand

OptionBest forStrengthsTradeoffsUse when
Simple backoffLow-volume apps and occasional 429 responsesEasy to implement and recommended by multiple providers.Not enough for sustained overload or customer-specific quota planning.Traffic is modest and failures are rare.
Queue and token budgetProduction workloads with predictable bursts, batch jobs, or customer tiersControls throughput before providers reject traffic.Adds latency and product decisions about priority and cancellation.You need fairness, priorities, and predictable capacity.
Fallback routingHigh-availability products that need alternate models or providersKeeps some user workflows alive during quota, outage, or latency events.Requires behavior tests, schema compatibility, and observability.A 429 or provider incident becomes a customer-facing outage.

Model limits are multi-dimensional

LLM limits can include requests per minute, input tokens per minute, output tokens per minute, requests per day, tokens per day, region quota, deployment quota, or account tier. One dashboard number rarely tells the whole story.

  • Track input tokens, output tokens, requests, streaming duration, and retries.
  • Read retry-after headers where providers expose them.
  • Do not hardcode limits that can change by tier, region, model, or workspace.

Use queues for product control

Retries alone do not decide which work matters. Queues let product teams prioritize paying customers, interactive requests, background jobs, and internal batch work differently.

  • Separate interactive traffic from batch traffic.
  • Cancel or degrade low-priority work before exhausting quota.
  • Surface waiting states instead of silent retry loops.

Reduce tokens before adding providers

Many rate-limit problems are prompt design problems. Smaller context, caching, batching, summarization, and cheaper routing can reduce pressure before multi-provider architecture is needed.

  • Trim repeated system and retrieval context.
  • Use prompt caching where the provider supports it.
  • Batch offline tasks instead of competing with interactive requests.

Decision Rules

A practical checklist

01

Use exponential backoff for occasional 429s.

02

Use queues and token budgets for sustained production traffic.

03

Use fallback routing only after schema and quality tests pass.

04

Plan capacity using successful task throughput, not only raw token limits.

Related Guides

Continue the decision path

Chinese Archive

Aligned deeper reading

Topic Hubs

Explore the wider search cluster

Industry Pages

See this guide in a buyer workflow

FAQ

Common questions

What is TPM in LLM APIs?

TPM usually means tokens per minute. Some providers separate input tokens and output tokens, so capacity planning should track both.

How should I handle 429 errors from LLM APIs?

Use exponential backoff with jitter for occasional errors, but add queues, token budgets, and fallback routing for sustained production traffic.

Are rate limits the same across OpenAI, Anthropic, Azure, Bedrock, and Gemini?

No. Limits vary by provider, model, region, tier, deployment, and account. Read current docs and use provider APIs where available instead of hardcoding assumptions.

Source Links

Primary references used for this guide

Build your own evaluation note

The strongest decision is always local to your workflow. Save the vendor links, define a representative task, record the exact prompt or command, and compare the final evidence instead of the marketing claim.

Return to the AI learning map