AI operations

LLM rate limits guide: design around TPM, RPM, queues, and retries

A practical guide to LLM API rate limits across OpenAI, Anthropic, Azure OpenAI, Bedrock, and Gemini: TPM, RPM, retry-after, backoff, queues, batching, fallbacks, and throughput planning.

Updated 2026-06-119 min readIntermediate

Read fallback routing guide Read API cost calculator guide

AI Buyer Readiness Scorecard

Turn this guide into procurement, security, ROI, rollout, and governance questions.

Use the scorecard before opening vendor pricing pages. It keeps commercial AI research tied to the workflow, data risk, operating cost, and evidence buyers need before a shortlist becomes a purchase.

Procurement trigger

Define the business event behind the search: budget review, renewal, security review, failed pilot, new workflow, or vendor consolidation.

Data and security review

Check whether prompts, files, logs, embeddings, customer records, regulated data, or source code will touch the AI system.

ROI and operating cost

Estimate seat cost, API usage, implementation time, review effort, support load, fallback work, and expected workflow savings.

Integration and rollout path

Map the tools, identity systems, data sources, approval steps, change management, and users needed for a real deployment.

Governance evidence

Collect policies, evals, audit logs, human review rules, incident response, vendor terms, and owner names before procurement asks.

Best for

Developers scaling LLM APIs beyond prototypes
SaaS teams planning traffic spikes and customer quotas
Platform engineers building retry, queue, and rate-limit middleware
Product teams that need graceful degradation instead of 429 errors

Not for

Hardcoded current limit tables that become stale
Retrying every request until the provider is overloaded
Ignoring output tokens and long-running tool calls

Comparison

Choose by workflow, not brand

Option	Best for	Strengths	Tradeoffs	Use when
Simple backoff	Low-volume apps and occasional 429 responses	Easy to implement and recommended by multiple providers.	Not enough for sustained overload or customer-specific quota planning.	Traffic is modest and failures are rare.
Queue and token budget	Production workloads with predictable bursts, batch jobs, or customer tiers	Controls throughput before providers reject traffic.	Adds latency and product decisions about priority and cancellation.	You need fairness, priorities, and predictable capacity.
Fallback routing	High-availability products that need alternate models or providers	Keeps some user workflows alive during quota, outage, or latency events.	Requires behavior tests, schema compatibility, and observability.	A 429 or provider incident becomes a customer-facing outage.

Model limits are multi-dimensional

LLM limits can include requests per minute, input tokens per minute, output tokens per minute, requests per day, tokens per day, region quota, deployment quota, or account tier. One dashboard number rarely tells the whole story.

Track input tokens, output tokens, requests, streaming duration, and retries.
Read retry-after headers where providers expose them.
Do not hardcode limits that can change by tier, region, model, or workspace.

Use queues for product control

Retries alone do not decide which work matters. Queues let product teams prioritize paying customers, interactive requests, background jobs, and internal batch work differently.

Separate interactive traffic from batch traffic.
Cancel or degrade low-priority work before exhausting quota.
Surface waiting states instead of silent retry loops.

Reduce tokens before adding providers

Many rate-limit problems are prompt design problems. Smaller context, caching, batching, summarization, and cheaper routing can reduce pressure before multi-provider architecture is needed.

Trim repeated system and retrieval context.
Use prompt caching where the provider supports it.
Batch offline tasks instead of competing with interactive requests.

Decision Rules

A practical checklist

Use exponential backoff for occasional 429s.

Use queues and token budgets for sustained production traffic.

Use fallback routing only after schema and quality tests pass.

Plan capacity using successful task throughput, not only raw token limits.

Related Guides

Continue the decision path

Read fallback routing guide

Route traffic when quota, latency, or provider health changes.

Open

Read API cost calculator guide

Estimate cost and throughput together.

Open

LLM fallback routing guide

Design fallback for quota and provider incidents.

Open

Prompt caching guide

Reduce repeated token pressure.

Open

AI Batch API guide

Move offline work away from interactive quota.

Open

Chinese Archive

Aligned deeper reading

AI product archive

Chinese AI product operations and architecture notes.

Open

AI agent archive

Chinese agent workflow materials.

Open

Topic Hubs

Explore the wider search cluster

Topic hub

RAG and models

Plan RAG systems, local LLM deployment, model APIs, cloud AI platforms, vector databases, evaluation, observability, rate limits, and cost optimization.

Open

Industry Pages

See this guide in a buyer workflow

Industry page

IT operations AI

Compare AI tools for ITSM, AIOps, SaaS management, LLM observability, gateways, rate limits, fallback routing, enterprise search, knowledge management, and IT governance.

Open

FAQ

Common questions

What is TPM in LLM APIs?

TPM usually means tokens per minute. Some providers separate input tokens and output tokens, so capacity planning should track both.

How should I handle 429 errors from LLM APIs?

Use exponential backoff with jitter for occasional errors, but add queues, token budgets, and fallback routing for sustained production traffic.

Are rate limits the same across OpenAI, Anthropic, Azure, Bedrock, and Gemini?

No. Limits vary by provider, model, region, tier, deployment, and account. Read current docs and use provider APIs where available instead of hardcoding assumptions.

Source Links

Primary references used for this guide

Reference

OpenAI rate limit cookbook

OpenAI cookbook example for handling rate limits with backoff.

Open

Reference

OpenAI rate-limit best practices

OpenAI help article on preventing rate-limit errors.

Open

Reference

Claude API rate limits

Anthropic documentation for Claude API rate limits.

Open

Reference

Azure OpenAI quota management

Microsoft documentation for managing Azure OpenAI quota.

Open

Reference

Amazon Bedrock quotas

AWS documentation for Amazon Bedrock quotas.

Open

Build your own evaluation note

The strongest decision is always local to your workflow. Save the vendor links, define a representative task, record the exact prompt or command, and compare the final evidence instead of the marketing claim.

Return to the AI learning map