Retrieval Augmented Localization Cuts LLM Terminology Errors 17-45%

Production localization translates isolated paragraphs and strings. A CI/CD pipeline diffs against the previous version and retranslates what changed — a UI string, a tooltip, a modified paragraph. Each request arrives at the LLM in isolation — without the surrounding page, without the document's full context, without any signal that this text is EU legal prose versus marketing copy. Without domain context injected at inference time, every isolated request is a fresh opportunity for terminology drift.

Retrieval Augmented Localization (RAL) closes this gap by enriching each translation request with glossary terms, brand voice rules, and locale-specific instructions at inference time — the same retrieve-inject pattern behind Retrieval Augmented Generation (RAG). In a controlled evaluation across five LLM providers and five European languages, RAL reduced terminology errors by 16.6-44.6%.

Key findings:

RAL reduced terminology errors by 16.6-44.6% across all five LLM providers tested
Holistic quality scores (GEMBA-DA) could not detect these differences. Deltas of 0.0007-0.0178, while MQM counted thousands fewer errors
Models with lower baseline terminology scores gained the most: Mistral (-44.6%) and Deepseek (-42.1%) vs. Anthropic (-24.4%) and Google (-16.6%)
Portuguese showed the largest per-locale improvement; French the smallest — the further domain terminology diverges from training data, the more RAL helps

The isolation problem#

The unit of production localization is small: a paragraph, a string, a diff. Rarely more than 200 words. Often fewer than 50. A JSON locale file contains individual keys, each holding a phrase or sentence. A CMS page is composed of blocks, each translated independently.

When the model encounters "provider" in an isolated English paragraph, it has to decide: is this Portuguese "fornecedor" (the common word) or "prestador" (the official EU legal term)? Without domain context, it picks the common one. Multiply this across every domain-specific term in every locale, and terminology drift becomes the default.

We set out to measure exactly how large this gap is — and whether injecting glossary context at inference time closes it.

The first attempt showed nothing#

Our initial experiment used 37 glossary terms per locale pair and scored translations at article level - each article (200-700 words) evaluated as a single unit. The results: GEMBA-DA — the WMT23 winning holistic quality prompt — reported 0.952 for raw and 0.952 for configured. MQM error annotation produced scores of 0.985-0.999 for every translation. No signal. No difference. By every metric, raw and glossary-augmented output were identical.

We almost published a null result. Then we looked at why.

Two problems. First, 37 glossary terms was too few - many test paragraphs contained zero glossary hits, so the configured engine had no advantage. Second, article-level scoring mathematically compresses quality differences into noise. MQM scores are computed as 1 - penalty / wordCount. A single major terminology error in a 500-word article: 1 - 5/500 = 0.99. The same error in a 50-word paragraph: 1 - 5/50 = 0.90. The error is identical. The score is not. At article level, every real quality difference vanishes above 0.98.

This is not just a measurement problem for our study. It applies to every translation benchmark that evaluates at page or article level. The errors are there. The metric cannot see them.

We changed the lens#

For the second iteration, we made four changes.

First, we expanded the glossary from 37 to 72 terms per locale pair — extracted from a training set of articles, separate from the test set used for evaluation. Second, we scored at paragraph level (50-200 words), matching the actual unit of production translation. Third, we added human reference translations to the MQM scoring prompt so judges could compare terminology directly. Fourth, we reduced judges from six to four. Deepseek and QWEN flagged only 1-3 errors per paragraph versus 5-15 for stricter judges — too lenient to add signal.

The signal appeared immediately.

Study design#

Dataset. We wanted the most terminology-dense text type available to stress-test glossary injection under demanding conditions. The EU AI Act (Regulation 2024/1689) fit: formal regulatory text where every paragraph carries terms with specific, officially defined translations. EUR-Lex publishes official human translations in all five target languages, enabling paragraph-by-paragraph scoring against ground truth. 15 articles, English into German, French, Spanish, Portuguese, and Italian.

Engines. Each provider was tested in two localization-engine configurations: a raw engine (the LLM on its own — no glossary, no retrieval, translating from training knowledge alone) and a RAL-augmented engine (the same model, with a domain glossary, brand voice profile, and locale-specific instructions applied at inference time). Ten engines in total, sharing the same configuration across all RAL-augmented engines.

Provider	Model	Raw engine	RAL engine
Anthropic	claude-opus-4.6	model only	glossary + brand voice + instructions
OpenAI	gpt-5.4	model only	glossary + brand voice + instructions
Google	gemini-3.1-pro-preview	model only	glossary + brand voice + instructions
Mistral	mistral-large-2512	model only	glossary + brand voice + instructions
Deepseek	deepseek-v3.2	model only	glossary + brand voice + instructions

QWEN was initially included but dropped from the final set — translations were slow and unreliable, the same issue that disqualified it as a judge.

RAL configuration. Each augmented engine contained 72 glossary terms per locale pair (70 custom translations plus 2 non-translatables), a brand voice profile (formal EU regulatory register), and 13 locale-specific instructions. Glossary terms were extracted from a training set of articles separate from the test set used for evaluation. Example entries: EN "provider" → PT "prestador" (not "fornecedor"); EN "high-risk AI system" → PT "sistema de IA de risco elevado" (not "sistema de IA de alto risco"). At inference time, only terms matching the current paragraph are retrieved and passed to the model — glossary size does not bloat the context window. Engines were configured on Lingo.dev as stateful localization engines — persistent context applied to every request.

Scoring. Each translated paragraph was scored by four LLM judges, averaged to smooth individual judge bias. Each judge scores all providers' outputs, not just its own:

Judge	Model
Anthropic	claude-sonnet-4.6
OpenAI	gpt-4.1
Google	gemini-2.5-flash
Mistral	mistral-large-2512

GEMBA-MQM. MQM (Multidimensional Quality Metrics) is a standard framework for translation quality evaluation — normally performed by trained human annotators. GEMBA-MQM, the WMT23 winning evaluation method, replaces human annotators with an LLM while following the same MQM protocol: the judge reads the translation and flags every error, assigning each a category and a severity.

Error categories: accuracy, fluency, style, terminology. Severity weights follow the official MQM standard: minor = 1, major = 5, critical = 25.

MQM score per paragraph: max(0, 1 - weighted penalty / word count). A 50-word paragraph with one major terminology error scores 1 - 5/50 = 0.90. A perfect paragraph scores 1.0. Error counts in the results tables are summed across all four judges and all paragraphs for a given provider and locale.

One change from the standard GEMBA-MQM prompt: we added the human reference translation. GEMBA-MQM is reference-free by design — the judge evaluates quality without seeing the "correct" answer. We added references because EUR-Lex publishes official translations of the EU AI Act in all five target languages, giving judges ground truth to compare terminology against.

GEMBA-DA. A holistic 0-1 quality score using the GEMBA-DA prompt (also WMT23 winning). Unlike MQM, it produces a single score with no error annotations. We include it as a sanity check — as the results show, it cannot detect terminology-level differences.

Deepseek was excluded from the judge panel due to overly lenient scoring (1-3 errors per paragraph vs 5-15 for stricter judges). Averaging across four judges smooths individual bias, and the relative raw-vs-RAL improvement is consistent within every judge.

Sample size. 535 paired paragraph observations per provider (107 paragraphs × 5 locales). Over 42,000 individual quality judgments total (535 paragraphs × 5 providers × 2 configurations × 8 scores each).

Terminology errors drop 16.6-44.6%#

Provider	Raw errors	RAL errors	Reduction
Mistral	3,336	1,847	-44.6%
Deepseek	3,672	2,127	-42.1%
OpenAI	2,276	1,508	-33.7%
Anthropic	1,559	1,179	-24.4%
Google	1,901	1,586	-16.6%

Terminology error counts from MQM across 15 articles, 5 locales, and 4 judges.

Improvement tracked inversely with baseline score. Mistral and Deepseek — with the highest raw error counts — saw 42.1-44.6% reductions. Anthropic and Google — which already reflected more EU legal terminology in training — saw smaller gains. The pattern: RAL compensates for what the model doesn't already know.

Meanwhile, GEMBA-DA - the holistic score - reported a delta of 0.0007-0.0178 between raw and RAL across all providers. The same translations that MQM flagged for 16.6-44.6% more terminology errors received nearly identical holistic scores. This is the measurement gap: holistic evaluation at any granularity cannot detect terminology-level quality differences.

Total errors (all MQM categories) showed a smaller but consistent reduction across all five providers:

Provider	Raw total	RAL total	Change
Deepseek	10,423	9,014	-13.5%
Mistral	8,846	7,812	-11.7%
OpenAI	7,563	7,155	-5.4%
Google	7,793	7,545	-3.2%
Anthropic	6,232	6,039	-3.1%

The gap between terminology reduction (16.6-44.6%) and total reduction (3.1-13.5%) is largely explained by style. LLM judges tend to flag text as "awkward" when it diverges from their training-data preferences, even when the divergence moves toward the official reference — a known limitation called self-preference bias. Terminology and accuracy are anchored against the reference; style has no anchor beyond the judge's own sense of what sounds natural.

Statistical significance#

Terminology error reduction was tested per provider using a paired Wilcoxon signed-rank test (one-sided, Holm-Bonferroni corrected across five providers). Per-paragraph terminology error counts were summed across four judges, then paired by paragraph (same source, same judges, raw vs RAL).

Provider	Paired paragraphs	Mean reduction/paragraph	95% CI	Cohen's d	p (adjusted)
Mistral	532	2.80	[2.42, 3.21]	0.60	< 0.001
Deepseek	526	2.94	[2.45, 3.44]	0.50	< 0.001
OpenAI	535	1.44	[1.12, 1.77]	0.37	< 0.001
Anthropic	533	0.71	[0.50, 0.93]	0.28	< 0.001
Google	533	0.59	[0.34, 0.85]	0.20	< 0.001

All five providers show statistically significant terminology error reductions (p < 0.001 after Holm-Bonferroni correction for multiple comparisons), with 95% confidence intervals excluding zero. Effect sizes range from medium-large (Mistral, d = 0.60) to small (Google, d = 0.20) — consistent with the pattern that models with lower baseline terminology coverage benefit more from RAL.

Where RAL matters most#

Portuguese showed the largest terminology improvements across all providers. Portuguese legal terminology diverges significantly from everyday Portuguese, and EU legal terms in Portuguese are underrepresented in LLM training data. French showed the smallest - French legal terms are well-represented in training corpora.

Case study: OpenAI Portuguese

OpenAI's raw output translated the EU AI Act into Portuguese using "alto risco" 71 times (the colloquial "high risk"), "fornecedores" 39 times, and "fornecedor" 36 times. The official EUR-Lex translations use "risco elevado" and "prestadores." With RAL, OpenAI Portuguese terminology errors dropped from 648 to 266 — a 59% reduction.

The pattern generalizes: locales whose domain terminology is further from the LLM's training distribution benefit more from RAL.

The mechanism#

The effective mechanism is straightforward. At inference time, the engine decomposes input text into n-gram phrases and embeds them. It then runs cosine similarity search against the glossary's vector index to find matching terms. Matched terms are injected into the LLM's context window alongside the source text. The model doesn't guess "fornecedor" or "prestador" — it sees the correct mapping in context and uses it. Structurally identical to RAG: embed, retrieve, inject, generate.

Provider ranking by raw quality#

Without RAL - raw model output only:

Rank	Provider	MQM avg
1	Anthropic	0.955
2	OpenAI	0.942
3	Google	0.938
4	Mistral	0.915
5	Deepseek	0.883

The 0.072 gap between Anthropic and Deepseek represents roughly 3-4 additional errors per 100-word paragraph. RAL narrowed this gap: Mistral with RAL (0.940 avg) approached Google's raw quality (0.938). A model at a fraction of the per-token cost, augmented with a 72-term glossary, matched the terminology accuracy of a more expensive model without one.

What this means in production#

The quality gap between raw LLM output and production-ready localization is a context problem — and it compounds. After ten releases without RAL, three different wrong translations of "provider" coexist across the product.

RAL breaks this pattern. The glossary is persistent — it applies to every request, regardless of what changed. The 72-term glossary that reduced errors by 16.6-44.6% in our study is not a one-time improvement. It is a consistency layer across every translation request over the lifetime of the product.

Two findings for teams shipping LLM translations: first, holistic quality scores cannot detect terminology-level problems. GEMBA-DA — the WMT23 winning method — scored raw and RAL-augmented translations within 0.0007-0.0178 of each other. MQM counted 16.6-44.6% fewer terminology errors. If you evaluate at page level with a single score, you are not seeing the full picture.

Second, the fix is simpler than the problem suggests. A domain glossary injected at inference time reduced terminology errors across every provider we tested. The model that translates best (Anthropic, MQM 0.955) still improved. The model with the highest baseline error rate (Deepseek, MQM 0.883) improved most.

RAL is to localization what RAG is to generation: the engineering layer between the model and production.

Next steps#

Introducing Lingo.dev v1.0

The localization engineering platform built around RAL

Localization engines

Configure models, glossaries, and brand voice per locale

Key findings:

RAL reduced terminology errors by 16.6-44.6% across all five LLM providers tested
Holistic quality scores (GEMBA-DA) could not detect these differences. Deltas of 0.0007-0.0178, while MQM counted thousands fewer errors
Models with lower baseline terminology scores gained the most: Mistral (-44.6%) and Deepseek (-42.1%) vs. Anthropic (-24.4%) and Google (-16.6%)
Portuguese showed the largest per-locale improvement; French the smallest — the further domain terminology diverges from training data, the more RAL helps

The isolation problem#

We set out to measure exactly how large this gap is — and whether injecting glossary context at inference time closes it.

The first attempt showed nothing#

We almost published a null result. Then we looked at why.

This is not just a measurement problem for our study. It applies to every translation benchmark that evaluates at page or article level. The errors are there. The metric cannot see them.

We changed the lens#

For the second iteration, we made four changes.

The signal appeared immediately.

Study design#

Provider	Model	Raw engine	RAL engine
Anthropic	claude-opus-4.6	model only	glossary + brand voice + instructions
OpenAI	gpt-5.4	model only	glossary + brand voice + instructions
Google	gemini-3.1-pro-preview	model only	glossary + brand voice + instructions
Mistral	mistral-large-2512	model only	glossary + brand voice + instructions
Deepseek	deepseek-v3.2	model only	glossary + brand voice + instructions

QWEN was initially included but dropped from the final set — translations were slow and unreliable, the same issue that disqualified it as a judge.

Scoring. Each translated paragraph was scored by four LLM judges, averaged to smooth individual judge bias. Each judge scores all providers' outputs, not just its own:

Judge	Model
Anthropic	claude-sonnet-4.6
OpenAI	gpt-4.1
Google	gemini-2.5-flash
Mistral	mistral-large-2512

Error categories: accuracy, fluency, style, terminology. Severity weights follow the official MQM standard: minor = 1, major = 5, critical = 25.

Terminology errors drop 16.6-44.6%#

Provider	Raw errors	RAL errors	Reduction
Mistral	3,336	1,847	-44.6%
Deepseek	3,672	2,127	-42.1%
OpenAI	2,276	1,508	-33.7%
Anthropic	1,559	1,179	-24.4%
Google	1,901	1,586	-16.6%

Terminology error counts from MQM across 15 articles, 5 locales, and 4 judges.

Total errors (all MQM categories) showed a smaller but consistent reduction across all five providers:

Provider	Raw total	RAL total	Change
Deepseek	10,423	9,014	-13.5%
Mistral	8,846	7,812	-11.7%
OpenAI	7,563	7,155	-5.4%
Google	7,793	7,545	-3.2%
Anthropic	6,232	6,039	-3.1%

Statistical significance#

Provider	Paired paragraphs	Mean reduction/paragraph	95% CI	Cohen's d	p (adjusted)
Mistral	532	2.80	[2.42, 3.21]	0.60	< 0.001
Deepseek	526	2.94	[2.45, 3.44]	0.50	< 0.001
OpenAI	535	1.44	[1.12, 1.77]	0.37	< 0.001
Anthropic	533	0.71	[0.50, 0.93]	0.28	< 0.001
Google	533	0.59	[0.34, 0.85]	0.20	< 0.001

Rank	Provider	MQM avg
1	Anthropic	0.955
2	OpenAI	0.942
3	Google	0.938
4	Mistral	0.915
5	Deepseek	0.883

What this means in production#

RAL is to localization what RAG is to generation: the engineering layer between the model and production.

Next steps#

Introducing Lingo.dev v1.0

The localization engineering platform built around RAL

Localization engines

Configure models, glossaries, and brand voice per locale

Retrieval Augmented Localization Cuts LLM Terminology Errors 17-45%

The isolation problem#

The first attempt showed nothing#

We changed the lens#

Study design#

Terminology errors drop 16.6-44.6%#

Statistical significance#

Where RAL matters most#

The mechanism#

Provider ranking by raw quality#

What this means in production#

Next steps#

Platform

Developer Tools

Resources

Company

Community

Retrieval Augmented Localization Cuts LLM Terminology Errors 17-45%

The isolation problem#

The first attempt showed nothing#

We changed the lens#

Study design#

Terminology errors drop 16.6-44.6%#

Statistical significance#

Where RAL matters most#

The mechanism#

Provider ranking by raw quality#

What this means in production#

Next steps#

Platform

Developer Tools

Resources

Company

Community