Provider Cache Control
Use provider-side prompt caching to reduce the cost of long, reused prompts in chat and coding tools.
Provider Cache Control
Most modern LLM providers offer prompt caching: when a request reuses a long prefix from a previous request (for example, a multi-thousand-token system prompt or a growing conversation history), the provider stores that prefix and serves it back at a steep discount on subsequent calls. Only the cached portion is discounted — new input tokens and all output tokens are still billed at the normal rate.
This is the behavior you see surfaced as cached_tokens in your usage payloads, and it is what makes chat apps, assistants, and coding tools (Cursor, Cline, Claude Code, etc.) economically viable on long contexts.
Looking for $0 on repeated calls instead of a discount on the cached portion? That is Gateway Caching, which serves byte-identical requests entirely from doteb without hitting the provider. It is a better fit for deterministic API workloads than for chat. See the Caching Overview for a side-by-side comparison.
Automatic caching
For most users, prompt caching just works — you do not need to change your request payloads.
Providers including OpenAI, Anthropic (when prompts cross the provider's minimum size), Google, DeepSeek, xAI, and Alibaba inspect incoming requests for shared prefixes and cache them automatically. doteb forwards the provider's cache metadata back to you in the response, and bills the cached portion at the model's cached_input rate.
For Anthropic and AWS Bedrock Claude, prompt caching is strictly opt-in via cache_control / cachePoint markers on the request body. To get automatic cache benefits without rewriting your requests, doteb injects those markers for you on long system and user messages by default. If you send long prompts sporadically — with gaps wider than the 5-minute TTL — you may want to disable this entirely, since you would otherwise pay the cache-write premium (1.25× input for 5m, 2× for 1h) without ever benefiting from a cache read.
To disable, open Project Settings → Caching → Provider Cache Writes and turn off "Allow provider cache writes". When disabled, the gateway strips all cache_control markers from outgoing requests for the project — both the ones it adds automatically and any markers your client sends. This covers callers that always emit markers regardless of the user's request cadence (e.g. Claude Code, Cursor, Cline). The change takes up to 5 minutes to take effect due to the project-settings cache.
To take advantage of automatic caching:
- Put stable content (system prompt, instructions, tool definitions, long documents) at the start of your messages
- Keep the variable portion (the latest user turn) at the end
- Reuse the same prefix across requests — even minor changes invalidate the cache
You can confirm the cache is working by inspecting usage.prompt_tokens_details.cached_tokens on the response. See Cost Breakdown for the full list of usage fields.
{
"usage": {
"prompt_tokens": 8200,
"completion_tokens": 150,
"prompt_tokens_details": {
"cached_tokens": 8000
},
"cost_details": {
"input_cost": 0.0006,
"cached_input_cost": 0.0008
}
}
}In this example, 8,000 of the 8,200 prompt tokens were served from the provider's cache and billed at the cached rate.
Pricing and routing
Cached input tokens are billed at the model's published cached_input price (typically 10–25% of the regular input price, depending on the provider and model). Output tokens and any non-cached input tokens are billed at the normal rate.
When the Smart Routing algorithm selects a provider for a large prompt (≥ 5,000 estimated tokens), it gives extra weight to providers that advertise cache support, since caching can substantially reduce the cost of repeated large prompts.
Explicit caching with cache_control
Some providers — most notably Anthropic — also support explicit cache control, where you mark specific content blocks as cacheable using a cache_control field. This gives you precise control over what gets cached and lets you opt into longer cache lifetimes than the default.
Explicit caching is provider-specific. Supported providers and TTLs at the time of writing:
| Provider | Models | Supported TTLs |
|---|---|---|
| Anthropic (Claude) | All Claude models | 5m (default), 1h |
| AWS Bedrock (Claude) | All Claude models | 5m (default), 1h |
| Alibaba (Qwen) | Qwen models with cache support | Provider-defined |
To mark content as cacheable, send the message content as an array of blocks and add a cache_control field to the block you want to cache:
{
"model": "claude-haiku-4-5",
"messages": [
{
"role": "system",
"content": [
{
"type": "text",
"text": "You are a helpful assistant. <long instructions...>",
"cache_control": { "type": "ephemeral", "ttl": "1h" }
}
]
},
{
"role": "user",
"content": "What is the capital of France?"
}
]
}Use ttl: "5m" (the default if omitted) for short-lived caches that match a single user's session, and ttl: "1h" when the same prefix will be reused over a longer window (for example, a coding agent that keeps the same project context warm across many requests).
Mixing explicit markers with automatic injection
Anthropic requires cache breakpoints with longer TTLs to appear before shorter ones (blocks are processed in the order tools, system, messages). The markers doteb injects automatically use the default 5-minute TTL, so they could never legally precede an explicit ttl: "1h" marker in your messages. To keep both features compatible:
- When your request contains an explicit
ttl: "1h"marker in the messages, doteb skips its automatic marker injection for that request entirely and forwards only your markers — the same behavior you would get calling the provider directly. - A
ttl: "1h"marker only on the system prompt does not disable automatic injection, since 5-minute breakpoints after it still satisfy the ordering rule. - Explicit markers that use the default 5-minute TTL coexist with automatic injection (capped at 4 breakpoints total per Anthropic's limit).
Cache writes are billed at a premium (typically 1.25x for 5m and 2x for 1h on Anthropic) the first time a cached block is created. After that, cache reads cost roughly 10% of the regular input price. The break-even point is usually one or two reuses — explicit caching is worth it whenever a marked block will be sent more than once within its TTL.
Anthropic returns a per-TTL breakdown of cache writes when you mix 5m and 1h blocks:
{
"usage": {
"cache_creation": {
"ephemeral_5m_input_tokens": 0,
"ephemeral_1h_input_tokens": 8000
},
"cache_read_input_tokens": 0
}
}For providers that publish a separate explicit-cache read rate (for example, Alibaba Qwen charges 10% for explicit cache reads vs. 20% for automatic cache reads), doteb detects the cache_control markers on your request and applies the explicit rate automatically.
Related
- Gateway Caching — serve byte-identical requests entirely from doteb at $0 cost
- Caching Overview — side-by-side comparison of provider caching vs. gateway caching
- Cost Breakdown — full reference for the usage and cost fields on every response
- Smart Routing — how cache support influences provider selection for large prompts
How is this guide?
Last updated on