Guozhen AIGlobal AI field notes and model intelligence
Back to AI decision guides

AI economics

Prompt caching guide: reduce LLM latency and token cost without changing the product

Learn when prompt caching helps, how OpenAI, Anthropic, and Gemini caching differ, and how to design prompts, RAG context, and agent workflows for cache hits.

Updated 2026-06-118 min readIntermediate

Best for

  • Teams sending repeated system prompts, tool schemas, or policy text
  • RAG apps that reuse large instruction and document prefixes
  • Coding agents that keep stable repository context across many calls
  • Product teams reducing latency without lowering answer quality

Not for

  • One-off prompts with little repeated context
  • Workflows where user content changes the whole prompt prefix every time
  • Cost planning without checking current provider pricing and cache rules

Comparison

Choose by workflow, not brand

OptionBest forStrengthsTradeoffsUse when
Automatic cachingApps where the provider detects repeated prompt prefixes without application-managed cache objectsMinimal code change and easy to adopt.Less explicit control over cache lifetime and hit behavior.You have stable prompt prefixes and want lower latency or cached-token pricing with low engineering overhead.
Explicit cachingLong documents, reusable corpora, policies, and workloads where the app can declare cached contentMore predictable reuse when supported by the provider.Requires lifecycle management, invalidation, and provider-specific implementation.You repeatedly ask questions against the same large context.
Application-level cachingStable final answers, deterministic transformations, summaries, and offline processingWorks across providers and can eliminate model calls entirely.Harder when answers depend on fresh data, user permissions, or changing model behavior.The same input should produce the same output or can be safely reused.

Design for stable prefixes

Most cache wins come from keeping the repeated part of the prompt stable. Put system instructions, tool schemas, examples, and reusable context before highly variable user content when provider rules reward prefix reuse.

  • Avoid adding timestamps, random IDs, or per-user noise inside the reusable prefix.
  • Version long policies and tool schemas deliberately.
  • Separate stable context from fresh user messages.

Where caching pays off

Caching matters most when repeated input is large: long codebase context, enterprise policy docs, tool definitions, multi-agent instructions, or repeated evaluation prompts.

  • Measure cached input tokens separately from normal input tokens.
  • Track latency before and after caching on real traffic.
  • Combine caching with routing and batching for non-urgent workloads.

Common cache misses

Cache misses often come from tiny prompt changes that look harmless: dynamic dates, reordered tools, unstable JSON, different model IDs, or inserting user-specific metadata before the stable prefix.

  • Keep canonical prompt templates in source control.
  • Log cache hit rates where the provider exposes them.
  • Review prompt changes the same way you review code changes.

Decision Rules

A practical checklist

01

Use prompt caching when repeated context is large and stable.

02

Use application caching when the whole answer can be safely reused.

03

Do not redesign prompts only for caching if answer quality drops.

04

Measure cache hit rate, latency, cost, and quality together.

Related Guides

Continue the decision path

Chinese Archive

Aligned deeper reading

Topic Hubs

Explore the wider search cluster

FAQ

Common questions

What is prompt caching?

Prompt caching lets a provider reuse computation for repeated prompt content, usually reducing latency and input-token cost when the repeated section is large enough and matches provider rules.

Does prompt caching always reduce cost?

No. It helps when prompts reuse large stable content and hit the cache. It helps less when every request has different prefixes or small inputs.

Should I use prompt caching or RAG?

Use RAG to retrieve the right evidence from a large corpus. Use prompt caching when the same large context or instructions are sent repeatedly.

Source Links

Primary references used for this guide

Build your own evaluation note

The strongest decision is always local to your workflow. Save the vendor links, define a representative task, record the exact prompt or command, and compare the final evidence instead of the marketing claim.

Return to the AI learning map