When your localization engine translates text, part of the prompt it sends to the LLM is identical on every request, and part of it changes from one request to the next. Prompt caching lets the engine reuse the stable part instead of paying to process it again every time. Those reused tokens show up in your usage as cache tokens, and they cost a fraction of normal input tokens.
How a translation prompt is built#
Every request the engine sends to a model is assembled from layers. Some layers are stable across all requests for the same engine and locale; one is dynamic and changes per request.
| Layer | Stable or dynamic | Cached |
|---|---|---|
| System prompt - engine identity, localization rules, grammar | Stable across every engine | Yes |
| Your instructions and brand voice, per locale | Stable until you edit the engine | Yes |
| Glossary terms retrieved for this specific request | Dynamic - varies per request | No |
| The text to translate | Dynamic | No |
The stable layers form a contiguous prefix at the front of the prompt. The engine marks the end of that prefix as a cache breakpoint: everything before it can be cached and reused, and everything after it - the per-request glossary, examples, and your input text - is sent fresh on each request.
Why the glossary isn't cached
The glossary is retrieved per request based on the exact text you're translating, so it changes from one request to the next. Keeping it after the cache breakpoint means the rest of the prompt stays reusable no matter which glossary terms a given request pulls in.
Why cached input is cheaper#
The first request for a given engine and locale writes the stable prefix to the provider's cache. Each later request that reuses that prefix reads it from the cache instead of reprocessing it from scratch. Providers bill cache reads at a fraction of the normal input-token rate, so the bulk of your prompt - the part that never changes - stops being re-billed at full price on every request.
The cache is short-lived and managed by the model provider, not by your engine. That means the benefit is largest when you translate a lot under the same engine and locale in a short window: requests arrive while the prefix is still warm and get read straight from the cache.
Caching is automatic
You don't configure anything. Whether a request uses caching depends on the model handling it - Anthropic and Google models use explicit cache breakpoints, OpenAI models cache long prefixes on their own, and some providers don't cache at all. The engine applies the right behavior per model.
The payoff#
- Lower cost - the stable prefix is paid for once at full price, then at the reduced cache-read rate on every repeat request.
- Lower latency - cached tokens don't need to be reprocessed, so warm requests come back faster.
- No setup - caching is on by default; there's nothing to enable in your engine config.
The gains compound with steady traffic against the same engine and locale - exactly the shape of a production localization pipeline, where the same configuration handles request after request.
Reading cache tokens in your usage#
Each translation response reports a usage breakdown that separates cache tokens from fresh input:
{
"usage": {
"inputTokens": 1200,
"outputTokens": 800,
"cacheReadTokens": 950,
"cacheWriteTokens": 0
}
}| Field | Meaning |
|---|---|
inputTokens | Prompt tokens processed fresh on this request |
outputTokens | Tokens the model generated |
cacheReadTokens | Prompt tokens served from the provider's cache. 0 when nothing was cached. |
cacheWriteTokens | Prompt tokens written to the cache on this request - a cache miss / first call. |
A first request for an engine and locale typically shows a positive cacheWriteTokens (the prefix is being written) and cacheReadTokens of 0. Follow-up requests while the cache is still warm flip this: cacheReadTokens climbs and cacheWriteTokens drops to 0. Track aggregate token usage across your engines in Reports.
