Prompt Caching
Prompt caching is the feature where an LLM provider stores and reuses the repeating prefix of a prompt (system prompt, conversation history, long document) across multiple requests. Instead of reprocessing the same tokens every time, the model loads them from cache — cutting cost and latency dramatically. Anthropic introduced it in Claude in 2024, followed by OpenAI and Google, and it became a standard LLM API feature by 2026.
Prompt caching is the feature where an LLM provider stores and reuses the repeating prefix of a prompt (system prompt, conversation history, long document) across multiple requests. Instead of reprocessing the same tokens every time, the model loads them from cache — cutting cost and latency dramatically. Anthropic introduced it in Claude in 2024, followed by OpenAI and Google, and it became a standard LLM API feature by 2026.
Why It Matters
RAG pipelines and agents inject long system prompts, conversation history, and retrieved documents into every request. Ten repeats means ten full-priced computations. Anthropic's documentation reports up to 90% cost reduction and 85% latency reduction on the cached portion. Production AI apps have fundamentally restructured their economics around prompt caching.
How It Works
- Mark cacheable sections: The developer explicitly marks which parts of the prompt are safe to cache (Anthropic uses
cache_controlblocks; OpenAI caches automatically). - First request (cache write): The model processes the full prompt and stores the marked section in the cache. This request actually costs slightly more due to cache-write overhead.
- Subsequent requests (cache read): When a request with the same prefix arrives, the model loads the internal state from cache. Those tokens bill at roughly 10% of the input price.
- Cache TTL: Caches typically live ~5 minutes (Anthropic) or longer and are evicted automatically without use.
When to Use It
Chatbot system prompts: Caching thousands of tokens of role, constraints, and examples instead of reprocessing them every turn.
Long-document QA: Stuffing a book, PDF, or manual into context and asking many questions. The document caches; only the question changes.
Agent tool definitions: Cache thousands of tokens of tool schemas so each tool call has lower latency.
Code assistants: Loading an entire project codebase into context for many follow-up questions.
RAG pipelines: Cache the frequently retrieved fixed documents to save cost on repeat queries.
Caveats
Exact match: The cached prefix must match token-for-token. Injecting variable data like dates or user IDs into the system prompt breaks the cache. Move variable parts after the cached region.
Minimum cache size: Anthropic requires at least 1,024 tokens (Sonnet/Opus) to cache. Short prompts gain nothing.
TTL management: A request must arrive within the TTL window for a cache hit. Low-traffic services need to "keep the cache warm" via periodic heartbeat requests.
Write overhead: The first request costs slightly more. Without reuse, you lose money.
2026 Evolution
Longer caches: Some providers now offer TTLs of hours to days, helpful for enterprise agents and always-on chatbots.
Per-user caching: Personalized system prompts cached per user.
Hybrid RAG: Caching frequently retrieved chunks to skip vector search on repeat queries.
GEO Implications
For an AI search engine to reuse blog content across many queries, the content must be in a "cache-friendly, stable form." Frequent URL changes or dynamic personalization inside the page break the cache. Blogs that serve structured Markdown, stable URLs, and static generation are more likely to be reused as cost-efficient sources by AI search infrastructure.
Sources: