Retrieval Augmented Localization (RAL) enriches LLM translation requests with glossary terms, brand voice rules, and locale-specific instructions at inference time. The approach mirrors Retrieval Augmented Generation (RAG) — retrieve relevant context, inject it alongside the input, get better output. In a controlled evaluation across six LLM providers and five European languages, RAL reduced terminology errors by 24-45%.
Key findings:
- RAL reduced terminology errors by 24-45% across all six LLM providers tested
- Holistic quality scores (GEMBA-DA) could not detect these differences. Deltas of 0.002-0.02, while MQM counted thousands fewer errors
- The glossary alone drove the entire quality gain — brand voice and locale-specific instructions added no measurable terminology benefit
- Weaker models benefited most: Mistral (-45%) and Deepseek (-42%) vs. Anthropic (-24%) and Google (-17%)
- Portuguese showed the largest per-locale improvement; French the smallest — the further domain terminology diverges from training data, the more RAL helps
How production localization works#
A CI/CD pipeline doesn't retranslate an entire product every release. It diffs against the previous version and retranslates what changed - a paragraph, a UI string, a modified tooltip. A JSON locale file contains individual keys, each holding a phrase or sentence. A CMS page is composed of blocks, each translated independently.
The unit of production localization is small: a paragraph, a string, a diff. Rarely more than 200 words. Often fewer than 50. Each translation request arrives at the LLM in isolation - without the surrounding page, without the document's full context, without any signal that this text is EU legal prose versus marketing copy.
When the model encounters "provider" in an isolated English paragraph, it has to decide: is this Portuguese "fornecedor" (the common word) or "prestador" (the official EU legal term)? Without domain context, it picks the common one. Multiply this across every domain-specific term in every locale, and terminology drift becomes the default.
We set out to measure exactly how large this gap is - and whether injecting glossary context at inference time closes it.
The first attempt showed nothing#
Our initial experiment used 37 glossary terms per locale pair and scored translations at article level - each article (200-700 words) evaluated as a single unit. The results: GEMBA-DA — the WMT23 winning holistic quality prompt — reported 0.952 for raw and 0.952 for configured. MQM error annotation produced scores of 0.985-0.999 for every translation. No signal. No difference. By every metric, raw and glossary-augmented output were identical.
We almost published a null result. Then we looked at why.
Two problems. First, 37 glossary terms was too few - many test paragraphs contained zero glossary hits, so the configured engine had no advantage. Second, article-level scoring mathematically compresses quality differences into noise. MQM scores are computed as 1 - penalty / wordCount. A single major terminology error in a 500-word article: 1 - 5/500 = 0.99. The same error in a 50-word paragraph: 1 - 5/50 = 0.90. The error is identical. The score is not. At article level, every real quality difference vanishes above 0.98.
This is not just a measurement problem for our study. It applies to every translation benchmark that evaluates at page or article level. The errors are there. The metric cannot see them.
We changed the lens#
For the second iteration, we made four changes.
First, we expanded the glossary from 37 to ~100-150 terms per locale pair — all 67 official definitions from Article 3 of the EU AI Act, plus terms from related articles. Second, we scored at paragraph level (50-200 words), matching the actual unit of production translation. Third, we added human reference translations to the MQM scoring prompt so judges could compare terminology directly. Fourth, we reduced judges from six to four. Deepseek and QWEN flagged only 1-3 errors per paragraph versus 5-15 for stricter judges — too lenient to add signal.
The signal appeared immediately.
Study design#
Dataset. The EU AI Act (Regulation 2024/1689), translated from English into German, French, Spanish, Portuguese, and Italian. 15 articles, scored paragraph-by-paragraph against official EUR-Lex human translations as reference.
Providers. Six LLMs, each in two configurations - raw (model only) and RAL-augmented (glossary + brand voice + instructions):
| Provider | Model |
|---|---|
| Anthropic | claude-opus-4.6 |
| OpenAI | gpt-5.4 |
| gemini-3.1-pro-preview | |
| Mistral | mistral-large-2512 |
| Deepseek | deepseek-v3.2 |
| QWEN | qwen3.5-397b-a17b |
RAL configuration. Each augmented engine contained ~100-150 glossary terms per locale pair (official EU legal terminology), a brand voice profile (formal EU regulatory register), and 13 locale-specific instructions. Engines were configured on Lingo.dev as stateful localization engines — persistent context applied to every request.
Scoring. Paragraph-level MQM with four independent LLM judges (Claude Sonnet 4.6, GPT-4.1, Gemini 2.5 Flash, Mistral Large), averaged to smooth bias. Absolute error counts varied roughly 3x across judges — Anthropic flagged 3-5 errors per paragraph, Mistral flagged 5-15. But the relative improvement between raw and RAL was consistent within each judge. Error categories: accuracy, fluency, style, terminology. Severity weights: minor=1, major=5, critical=25.
Sample size. ~530 paired paragraph observations per provider across five locales. Over 50,000 individual quality judgments total.
Terminology errors drop 24-45%#
| Provider | Raw errors | RAL errors | Reduction |
|---|---|---|---|
| Mistral | 3,336 | 1,847 | -45% |
| Deepseek | 3,672 | 2,127 | -42% |
| OpenAI | 2,276 | 1,508 | -34% |
| QWEN | 3,206 | 2,398 | -25% |
| Anthropic | 1,559 | 1,179 | -24% |
| 1,901 | 1,586 | -17% |
Terminology error counts from MQM across 15 articles, 5 locales, and 4 judges. Only paired paragraphs compared.
Weaker models benefited most. Mistral and Deepseek - with the highest raw error counts - saw 42-45% reductions. Anthropic and Google - which already knew most official EU terminology from training data - saw smaller gains. The pattern: RAL compensates for what the model doesn't know. Models that know less benefit more.
Meanwhile, GEMBA-DA - the holistic score - reported a delta of 0.002-0.02 between raw and RAL across all providers. The same translations that MQM flagged for 24-45% more terminology errors received nearly identical holistic scores. This is the measurement gap: holistic evaluation at any granularity cannot detect terminology-level quality differences.
Total errors (all MQM categories) showed a smaller but consistent reduction for five of six providers:
| Provider | Raw total | RAL total | Change |
|---|---|---|---|
| Deepseek | 10,423 | 9,014 | -13.5% |
| Mistral | 8,846 | 7,812 | -11.7% |
| OpenAI | 7,563 | 7,155 | -5.4% |
| 7,793 | 7,545 | -3.2% | |
| Anthropic | 6,232 | 6,039 | -3.1% |
| QWEN | 9,468 | 10,999 | +16.2% |
QWEN is an outlier - RAL-augmented translations had 16.2% more total errors than raw. This regression is under investigation.
Where RAL matters most#
Portuguese showed the largest terminology improvements across all providers. Portuguese legal terminology diverges significantly from everyday Portuguese, and EU legal terms in Portuguese are underrepresented in LLM training data. French showed the smallest - French legal terms are well-represented in training corpora.
Case study: OpenAI Portuguese
OpenAI's raw output translated the EU AI Act into Portuguese using "alto risco" 71 times (the colloquial "high risk"), "fornecedores" 39 times, and "fornecedor" 36 times. The official EUR-Lex translations use "risco elevado" and "prestadores." With RAL, OpenAI Portuguese terminology errors dropped from 648 to 266 — a 59% reduction.
The pattern generalizes: locales whose domain terminology is further from the LLM's training distribution benefit more from RAL.
The mechanism: what actually helped#
We ran a separate experiment removing brand voice and locale-specific instructions from one engine (OpenAI), keeping only the glossary. Terminology improvement was identical to the full RAL configuration, with fewer style errors. The glossary - the simplest RAL component - drove the entire quality gain.
Brand voice ("preserve exact phrasing," "maintain authoritative institutional tone") added rigidity to translations without measurable benefit. The locale-specific instructions (elision rules, quotation conventions) were difficult for MQM judges to evaluate consistently. Style was the noisiest MQM category — one judge (Anthropic) accounted for 63% of style divergences between raw and configured output.
The effective mechanism is straightforward. At inference time, the engine decomposes input text into n-gram phrases and embeds them. It then runs cosine similarity search against the glossary's vector index to find matching terms. Matched terms are injected into the LLM's context window alongside the source text. The model doesn't guess "fornecedor" or "prestador" — it sees the correct mapping in context and uses it. Structurally identical to RAG: embed, retrieve, inject, generate.
Provider ranking by raw quality#
Without RAL - raw model output only:
| Rank | Provider | MQM avg |
|---|---|---|
| 1 | Anthropic | 0.955 |
| 2 | OpenAI | 0.942 |
| 3 | 0.938 | |
| 4 | Mistral | 0.915 |
| 5 | QWEN | 0.894 |
| 6 | Deepseek | 0.883 |
The 0.072 gap between Anthropic and Deepseek represents roughly 3-4 additional errors per 100-word paragraph. RAL narrowed this gap: Mistral with RAL (0.924 avg) approached Google's raw quality (0.938). A model at a fraction of the per-token cost, augmented with a 150-term glossary, matched the terminology accuracy of a more expensive model without one.
Limitations#
- Domain specificity. The EU AI Act is formal legal text with official terminology. RAL's impact on marketing copy, product UI, or literary translation is not tested here - though the mechanism (glossary injection) is domain-agnostic.
- Statistical significance. Sample sizes (~530 paired observations per provider) support detection of the observed differences, but formal p-values have not been computed.
- Component isolation. The glossary-only experiment was conducted for one provider (OpenAI). We cannot generalize that brand voice and instructions are unhelpful across all providers.
- Style error confound. Early RAL-augmented translations included guillemet-wrapping instructions that inflated style error counts by ~1,000 errors. These instructions were removed, but not all translations have been re-scored.
- QWEN regression. The cause of QWEN's 16.2% error increase under RAL is not yet identified.
- Conflict of interest. This study was conducted by Lingo.dev using Lingo.dev's localization engines. Scoring prompts are based on the open GEMBA and GEMBA-MQM frameworks from WMT23. Raw scenario data (6,400 scored paragraphs with individual error annotations) is available for independent verification.
What this means in production#
The quality gap between raw LLM output and production-ready localization is not primarily a model problem. It is a context problem - and it compounds.
In a diff-based localization workflow, each translation request is isolated. The LLM translates a changed paragraph without the surrounding page, without memory of how it translated the same term yesterday, without knowing that "provider" in this codebase means "prestador," not "fornecedor." Without RAL, every isolated request is a fresh opportunity for terminology drift. After ten releases, three different wrong translations of "provider" coexist across the product.
RAL breaks this pattern. The glossary is persistent - it applies to every request, regardless of what changed. The 150-term glossary that reduced errors by 24-45% in our study is not a one-time improvement. It is a consistency layer across every translation request over the lifetime of the product.
Two findings for teams shipping LLM translations: first, holistic quality scores cannot detect terminology-level problems. GEMBA-DA — the WMT23 winning method — scored raw and RAL-augmented translations within 0.002-0.02 of each other. MQM counted 24-45% fewer terminology errors. If you evaluate at page level with a single score, you are not seeing the full picture.
Second, the fix is simpler than the problem suggests. Not a better model - a better context pipeline. A 150-term glossary, injected at inference time, reduced terminology errors across every provider we tested. The model that translates best (Anthropic, MQM 0.955) still improved. The model that translates worst (Deepseek, MQM 0.883) improved most.
RAL is to localization what RAG is to generation: the engineering layer between the model and production.

