Service Tiers
Trade latency against cost on supported OpenAI and Google models with Flex and Priority processing tiers.
Service Tiers
Some OpenAI and Google models support selectable processing tiers that trade
latency and availability against price. You pick one per request with the
OpenAI-compatible service_tier parameter, and doteb forwards it only
when the selected provider/model mapping supports that tier.
| Tier | service_tier | Cost vs. standard | Latency / availability |
|---|---|---|---|
| Standard | default / auto / omit | baseline | Normal on-demand latency |
| Flex | flex | −50% | Best-effort; may be preempted under load |
| Priority | priority | varies by model | Prioritized above standard and flex traffic |
Using the service_tier parameter
curl -X POST "https://api.doteb.com/v1/chat/completions" \
-H "Authorization: Bearer $LLM_GATEWAY_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "google-vertex/gemini-2.5-pro",
"service_tier": "priority",
"messages": [
{ "role": "user", "content": "Summarize this incident report." }
]
}'Accepted values are flex, priority, and default/auto (standard). If you
request flex or priority for a provider/model mapping that does not support
that tier, the gateway returns a 400 unsupported_service_tier error and logs
the request as a client error.
Supported providers
Service tiers are explicit per provider/model mapping. Check the model page for the exact tiers exposed by each provider card.
-
OpenAI (
openai) — sent as the OpenAIservice_tierrequest field for supported OpenAI models. Flex is billed at 0.5x standard token prices and Priority uses the model-specific multiplier shown on the model page. -
Google Vertex AI (
google-vertex) — sent as theX-Vertex-AI-LLM-Shared-Request-Typerequest header. Flex and Priority are served only on the global endpoint, which is the gateway default. Google Flex PayGo applies a 0.5x multiplier; Google Priority PayGo applies a 1.8x multiplier. -
Google AI Studio / Gemini API (
google-ai-studio) — sent as aservice_tierfield in the request body for configured models that opt in.
Tiers are supported on a subset of models, and the Flex and Priority subsets differ by provider. For example, Google Flex PayGo lists Gemini 3 image / Nano Banana models, but Google Priority PayGo does not; those configured image mappings are Flex-only.
Pricing uses multipliers
Service tiers do not define separate model prices in doteb. They multiply the provider mapping's standard token prices:
- Standard /
default/auto: 1x - Flex: 0.5x
- Priority: model/provider-specific, shown on the model page
The multiplier scales per-token costs, including input, output, cached, and image tokens. Flat per-request and web-search fees are not tier-scaled.
Billing follows the served tier
When a provider reports the tier that was actually served, doteb bills that returned tier instead of blindly billing the requested value:
- A
priorityrequest that runs as priority is billed at 2.5x. - A
flexrequest that runs as flex is billed at 0.5x. - A request that is served as standard is billed at the standard 1x rate.
The served tier is read back from the provider response — Vertex reports it in
usageMetadata.trafficType (ON_DEMAND_PRIORITY / ON_DEMAND_FLEX /
ON_DEMAND), Google AI Studio reports it in the x-gemini-service-tier
response header, and OpenAI can return service_tier in response payloads or
stream events.
doteb rejects unsupported tier requests before provider routing. For
example, gemini-3-pro-image-preview currently exposes Flex for Google AI
Studio and Vertex, but not Priority.
You can see per-tier pricing for each model on its model page. Supported provider cards include a Service Tier selector in the card header and show the active multiplier next to each tier.
Sources
How is this guide?
Last updated on