doteb
Features

Service Tiers

Trade latency against cost on supported OpenAI and Google models with Flex and Priority processing tiers.

Service Tiers

Some OpenAI and Google models support selectable processing tiers that trade latency and availability against price. You pick one per request with the OpenAI-compatible service_tier parameter, and doteb forwards it only when the selected provider/model mapping supports that tier.

Tierservice_tierCost vs. standardLatency / availability
Standarddefault / auto / omitbaselineNormal on-demand latency
Flexflex−50%Best-effort; may be preempted under load
Prioritypriorityvaries by modelPrioritized above standard and flex traffic

Using the service_tier parameter

curl -X POST "https://api.doteb.com/v1/chat/completions" \
  -H "Authorization: Bearer $LLM_GATEWAY_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "google-vertex/gemini-2.5-pro",
    "service_tier": "priority",
    "messages": [
      { "role": "user", "content": "Summarize this incident report." }
    ]
  }'

Accepted values are flex, priority, and default/auto (standard). If you request flex or priority for a provider/model mapping that does not support that tier, the gateway returns a 400 unsupported_service_tier error and logs the request as a client error.

Supported providers

Service tiers are explicit per provider/model mapping. Check the model page for the exact tiers exposed by each provider card.

  • OpenAI (openai) — sent as the OpenAI service_tier request field for supported OpenAI models. Flex is billed at 0.5x standard token prices and Priority uses the model-specific multiplier shown on the model page.

  • Google Vertex AI (google-vertex) — sent as the X-Vertex-AI-LLM-Shared-Request-Type request header. Flex and Priority are served only on the global endpoint, which is the gateway default. Google Flex PayGo applies a 0.5x multiplier; Google Priority PayGo applies a 1.8x multiplier.

  • Google AI Studio / Gemini API (google-ai-studio) — sent as a service_tier field in the request body for configured models that opt in.

Tiers are supported on a subset of models, and the Flex and Priority subsets differ by provider. For example, Google Flex PayGo lists Gemini 3 image / Nano Banana models, but Google Priority PayGo does not; those configured image mappings are Flex-only.

Pricing uses multipliers

Service tiers do not define separate model prices in doteb. They multiply the provider mapping's standard token prices:

  • Standard / default / auto: 1x
  • Flex: 0.5x
  • Priority: model/provider-specific, shown on the model page

The multiplier scales per-token costs, including input, output, cached, and image tokens. Flat per-request and web-search fees are not tier-scaled.

Billing follows the served tier

When a provider reports the tier that was actually served, doteb bills that returned tier instead of blindly billing the requested value:

  • A priority request that runs as priority is billed at 2.5x.
  • A flex request that runs as flex is billed at 0.5x.
  • A request that is served as standard is billed at the standard 1x rate.

The served tier is read back from the provider response — Vertex reports it in usageMetadata.trafficType (ON_DEMAND_PRIORITY / ON_DEMAND_FLEX / ON_DEMAND), Google AI Studio reports it in the x-gemini-service-tier response header, and OpenAI can return service_tier in response payloads or stream events.

doteb rejects unsupported tier requests before provider routing. For example, gemini-3-pro-image-preview currently exposes Flex for Google AI Studio and Vertex, but not Priority.

You can see per-tier pricing for each model on its model page. Supported provider cards include a Service Tier selector in the card header and show the active multiplier next to each tier.

Sources

How is this guide?

Last updated on

On this page

Ready for production?

Ship to production with SSO, audit logs, spend controls, and guardrails your security team will approve.

Explore Enterprise