Caching
Overview of the two types of caching available in doteb.
Caching
doteb supports two distinct kinds of caching, and they solve different problems. Pick the one that matches your workload — they can also be used together.
Provider / Model Caching
The provider performs the caching. When your request reuses a long prefix from a previous call (a system prompt, conversation history, tool definitions, a long document), the model serves that prefix from its prompt cache and bills it at a reduced rate. New input tokens and all output tokens are still billed at the normal rate — only the cached portion is discounted.
This is the type of caching that powers efficient chat-based and assistant-based interactions, including chat apps and coding tools (Cursor, Cline, Claude Code, etc.) where the same context is reused turn after turn.
You see it in your usage as prompt_tokens_details.cached_tokens. For most providers it works automatically; some (notably Anthropic) also let you mark blocks explicitly with cache_control and choose a longer TTL.
→ Read the Provider Cache Control docs
Gateway Caching
doteb performs the caching. When a request is byte-identical to a previous one (same model, same messages, same parameters), the response is served from the gateway's cache without any provider call. Repeated identical calls cost $0.
This is most useful for deterministic API workloads — classification, batch jobs, FAQ lookups, retries — rather than free-form chat, because chat prompts almost always differ on the latest turn.
→ Read the Gateway Caching docs
Which one do I want?
| If you… | Use |
|---|---|
| Build a chat app, assistant, or coding tool | Provider Cache Control |
| Send long system prompts or growing conversation history | Provider Cache Control |
| Want longer cache lifetimes than the provider default | Provider Cache Control (explicit cache_control) |
| Send the exact same request many times (batches, retries, FAQs) | Gateway Caching |
| Want $0 on repeated calls instead of a discount | Gateway Caching |
The two are not mutually exclusive. A coding tool can rely on provider caching for its long system prompt and enable gateway caching so that deterministic tool calls (e.g., file lookups) cost nothing on retry.
How is this guide?
Last updated on