AI economics

Prompt caching guide: reduce LLM latency and token cost without changing the product

Learn when prompt caching helps, how OpenAI, Anthropic, and Gemini caching differ, and how to design prompts, RAG context, and agent workflows for cache hits.

Updated 2026-06-118 min readIntermediate

Read AI API cost guide Open API cost calculator

AI Buyer Readiness Scorecard

Turn this guide into procurement, security, ROI, rollout, and governance questions.

Use the scorecard before opening vendor pricing pages. It keeps commercial AI research tied to the workflow, data risk, operating cost, and evidence buyers need before a shortlist becomes a purchase.

Procurement trigger

Define the business event behind the search: budget review, renewal, security review, failed pilot, new workflow, or vendor consolidation.

Data and security review

Check whether prompts, files, logs, embeddings, customer records, regulated data, or source code will touch the AI system.

ROI and operating cost

Estimate seat cost, API usage, implementation time, review effort, support load, fallback work, and expected workflow savings.

Integration and rollout path

Map the tools, identity systems, data sources, approval steps, change management, and users needed for a real deployment.

Governance evidence

Collect policies, evals, audit logs, human review rules, incident response, vendor terms, and owner names before procurement asks.

Best for

Teams sending repeated system prompts, tool schemas, or policy text
RAG apps that reuse large instruction and document prefixes
Coding agents that keep stable repository context across many calls
Product teams reducing latency without lowering answer quality

Not for

One-off prompts with little repeated context
Workflows where user content changes the whole prompt prefix every time
Cost planning without checking current provider pricing and cache rules

Comparison

Choose by workflow, not brand

Option	Best for	Strengths	Tradeoffs	Use when
Automatic caching	Apps where the provider detects repeated prompt prefixes without application-managed cache objects	Minimal code change and easy to adopt.	Less explicit control over cache lifetime and hit behavior.	You have stable prompt prefixes and want lower latency or cached-token pricing with low engineering overhead.
Explicit caching	Long documents, reusable corpora, policies, and workloads where the app can declare cached content	More predictable reuse when supported by the provider.	Requires lifecycle management, invalidation, and provider-specific implementation.	You repeatedly ask questions against the same large context.
Application-level caching	Stable final answers, deterministic transformations, summaries, and offline processing	Works across providers and can eliminate model calls entirely.	Harder when answers depend on fresh data, user permissions, or changing model behavior.	The same input should produce the same output or can be safely reused.

Design for stable prefixes

Most cache wins come from keeping the repeated part of the prompt stable. Put system instructions, tool schemas, examples, and reusable context before highly variable user content when provider rules reward prefix reuse.

Avoid adding timestamps, random IDs, or per-user noise inside the reusable prefix.
Version long policies and tool schemas deliberately.
Separate stable context from fresh user messages.

Where caching pays off

Caching matters most when repeated input is large: long codebase context, enterprise policy docs, tool definitions, multi-agent instructions, or repeated evaluation prompts.

Measure cached input tokens separately from normal input tokens.
Track latency before and after caching on real traffic.
Combine caching with routing and batching for non-urgent workloads.

Common cache misses

Cache misses often come from tiny prompt changes that look harmless: dynamic dates, reordered tools, unstable JSON, different model IDs, or inserting user-specific metadata before the stable prefix.

Keep canonical prompt templates in source control.
Log cache hit rates where the provider exposes them.
Review prompt changes the same way you review code changes.

Decision Rules

A practical checklist

Use prompt caching when repeated context is large and stable.

Use application caching when the whole answer can be safely reused.

Do not redesign prompts only for caching if answer quality drops.

Measure cache hit rate, latency, cost, and quality together.

Related Guides

Continue the decision path

Read AI API cost guide

Connect caching to token cost and monthly product economics.

Open

Open API cost calculator

Estimate how cached input changes cost at scale.

Open

AI API cost calculator guide

Estimate monthly cost with cached and uncached input.

Open

AI batch API guide

Use async batch processing when latency is not required.

Open

Context window guide

Understand why repeated long context becomes expensive.

Open

Chinese Archive

Aligned deeper reading

AI prompt archive

Chinese prompt and workflow design materials.

Open

AI product manager archive

Chinese AI product and cost planning notes.

Open

Topic Hubs

Explore the wider search cluster

Topic hub

RAG and models

Plan RAG systems, local LLM deployment, model APIs, cloud AI platforms, vector databases, evaluation, observability, rate limits, and cost optimization.

Open

FAQ

Common questions

What is prompt caching?

Prompt caching lets a provider reuse computation for repeated prompt content, usually reducing latency and input-token cost when the repeated section is large enough and matches provider rules.

Does prompt caching always reduce cost?

No. It helps when prompts reuse large stable content and hit the cache. It helps less when every request has different prefixes or small inputs.

Should I use prompt caching or RAG?

Use RAG to retrieve the right evidence from a large corpus. Use prompt caching when the same large context or instructions are sent repeatedly.

Source Links

Primary references used for this guide

Reference

OpenAI prompt caching

Official OpenAI prompt caching guide.

Open

Reference

Anthropic prompt caching

Official Anthropic prompt caching documentation.

Open

Reference

Gemini context caching

Official Gemini API context caching documentation.

Open

Build your own evaluation note

The strongest decision is always local to your workflow. Save the vendor links, define a representative task, record the exact prompt or command, and compare the final evidence instead of the marketing claim.

Return to the AI learning map