DocsPricingResearchEnterpriseCareers
Hiring
Sign inSign upTalk to a Localization Expert
All posts

Retrieval Augmented Localization Cuts LLM Terminology Errors 17-45%

Veronica PrilutskayaVeronica Prilutskaya, CPO & Co-Founder·Published about 1 month ago·10 min read

Production localization translates isolated paragraphs and strings. A CI/CD pipeline diffs against the previous version and retranslates what changed — a UI string, a tooltip, a modified paragraph. Each request arrives at the LLM in isolation — without the surrounding page, without the document's full context, without any signal that this text is EU legal prose versus marketing copy. Without domain context injected at inference time, every isolated request is a fresh opportunity for terminology drift.

Retrieval Augmented Localization (RAL) closes this gap by enriching each translation request with glossary terms, brand voice rules, and locale-specific instructions at inference time — the same retrieve-inject pattern behind Retrieval Augmented Generation (RAG). In a controlled evaluation across five LLM providers and five European languages, RAL reduced terminology errors by 16.6-44.6%.

Key findings:

  • RAL reduced terminology errors by 16.6-44.6% across all five LLM providers tested
  • Holistic quality scores (GEMBA-DA) could not detect these differences. Deltas of 0.0007-0.0178, while MQM counted thousands fewer errors
  • Models with lower baseline terminology scores gained the most: Mistral (-44.6%) and Deepseek (-42.1%) vs. Anthropic (-24.4%) and Google (-16.6%)
  • Portuguese showed the largest per-locale improvement; French the smallest — the further domain terminology diverges from training data, the more RAL helps

The isolation problem#

The unit of production localization is small: a paragraph, a string, a diff. Rarely more than 200 words. Often fewer than 50. A JSON locale file contains individual keys, each holding a phrase or sentence. A CMS page is composed of blocks, each translated independently.

When the model encounters "provider" in an isolated English paragraph, it has to decide: is this Portuguese "fornecedor" (the common word) or "prestador" (the official EU legal term)? Without domain context, it picks the common one. Multiply this across every domain-specific term in every locale, and terminology drift becomes the default.

We set out to measure exactly how large this gap is — and whether injecting glossary context at inference time closes it.

The first attempt showed nothing#

Our initial experiment used 37 glossary terms per locale pair and scored translations at article level - each article (200-700 words) evaluated as a single unit. The results: GEMBA-DA — the WMT23 winning holistic quality prompt — reported 0.952 for raw and 0.952 for configured. MQM error annotation produced scores of 0.985-0.999 for every translation. No signal. No difference. By every metric, raw and glossary-augmented output were identical.

We almost published a null result. Then we looked at why.

Two problems. First, 37 glossary terms was too few - many test paragraphs contained zero glossary hits, so the configured engine had no advantage. Second, article-level scoring mathematically compresses quality differences into noise. MQM scores are computed as 1 - penalty / wordCount. A single major terminology error in a 500-word article: 1 - 5/500 = 0.99. The same error in a 50-word paragraph: 1 - 5/50 = 0.90. The error is identical. The score is not. At article level, every real quality difference vanishes above 0.98.

This is not just a measurement problem for our study. It applies to every translation benchmark that evaluates at page or article level. The errors are there. The metric cannot see them.

We changed the lens#

For the second iteration, we made four changes.

First, we expanded the glossary from 37 to 72 terms per locale pair — extracted from a training set of articles, separate from the test set used for evaluation. Second, we scored at paragraph level (50-200 words), matching the actual unit of production translation. Third, we added human reference translations to the MQM scoring prompt so judges could compare terminology directly. Fourth, we reduced judges from six to four. Deepseek and QWEN flagged only 1-3 errors per paragraph versus 5-15 for stricter judges — too lenient to add signal.

The signal appeared immediately.

Study design#

Dataset. We wanted the most terminology-dense text type available to stress-test glossary injection under demanding conditions. The EU AI Act (Regulation 2024/1689) fit: formal regulatory text where every paragraph carries terms with specific, officially defined translations. EUR-Lex publishes official human translations in all five target languages, enabling paragraph-by-paragraph scoring against ground truth. 15 articles, English into German, French, Spanish, Portuguese, and Italian.

Engines. Each provider was tested in two localization-engine configurations: a raw engine (the LLM on its own — no glossary, no retrieval, translating from training knowledge alone) and a RAL-augmented engine (the same model, with a domain glossary, brand voice profile, and locale-specific instructions applied at inference time). Ten engines in total, sharing the same configuration across all RAL-augmented engines.

ProviderModelRaw engineRAL engine
Anthropicclaude-opus-4.6model onlyglossary + brand voice + instructions
OpenAIgpt-5.4model onlyglossary + brand voice + instructions
Googlegemini-3.1-pro-previewmodel onlyglossary + brand voice + instructions
Mistralmistral-large-2512model onlyglossary + brand voice + instructions
Deepseekdeepseek-v3.2model onlyglossary + brand voice + instructions

QWEN was initially included but dropped from the final set — translations were slow and unreliable, the same issue that disqualified it as a judge.

RAL configuration. Each augmented engine contained 72 glossary terms per locale pair (70 custom translations plus 2 non-translatables), a brand voice profile (formal EU regulatory register), and 13 locale-specific instructions. Glossary terms were extracted from a training set of articles separate from the test set used for evaluation. Example entries: EN "provider" → PT "prestador" (not "fornecedor"); EN "high-risk AI system" → PT "sistema de IA de risco elevado" (not "sistema de IA de alto risco"). At inference time, only terms matching the current paragraph are retrieved and passed to the model — glossary size does not bloat the context window. Engines were configured on Lingo.dev as stateful localization engines — persistent context applied to every request.

Scoring. Each translated paragraph was scored by four LLM judges, averaged to smooth individual judge bias. Each judge scores all providers' outputs, not just its own:

JudgeModel
Anthropicclaude-sonnet-4.6
OpenAIgpt-4.1
Googlegemini-2.5-flash
Mistralmistral-large-2512

GEMBA-MQM. MQM (Multidimensional Quality Metrics) is a standard framework for translation quality evaluation — normally performed by trained human annotators. GEMBA-MQM, the WMT23 winning evaluation method, replaces human annotators with an LLM while following the same MQM protocol: the judge reads the translation and flags every error, assigning each a category and a severity.

Error categories: accuracy, fluency, style, terminology. Severity weights follow the official MQM standard: minor = 1, major = 5, critical = 25.

MQM score per paragraph: max(0, 1 - weighted penalty / word count). A 50-word paragraph with one major terminology error scores 1 - 5/50 = 0.90. A perfect paragraph scores 1.0. Error counts in the results tables are summed across all four judges and all paragraphs for a given provider and locale.

One change from the standard GEMBA-MQM prompt: we added the human reference translation. GEMBA-MQM is reference-free by design — the judge evaluates quality without seeing the "correct" answer. We added references because EUR-Lex publishes official translations of the EU AI Act in all five target languages, giving judges ground truth to compare terminology against.

GEMBA-DA. A holistic 0-1 quality score using the GEMBA-DA prompt (also WMT23 winning). Unlike MQM, it produces a single score with no error annotations. We include it as a sanity check — as the results show, it cannot detect terminology-level differences.

Deepseek was excluded from the judge panel due to overly lenient scoring (1-3 errors per paragraph vs 5-15 for stricter judges). Averaging across four judges smooths individual bias, and the relative raw-vs-RAL improvement is consistent within every judge.

Sample size. 535 paired paragraph observations per provider (107 paragraphs × 5 locales). Over 42,000 individual quality judgments total (535 paragraphs × 5 providers × 2 configurations × 8 scores each).

Terminology errors drop 16.6-44.6%#

ProviderRaw errorsRAL errorsReduction
Mistral3,3361,847-44.6%
Deepseek3,6722,127-42.1%
OpenAI2,2761,508-33.7%
Anthropic1,5591,179-24.4%
Google1,9011,586-16.6%

Terminology error counts from MQM across 15 articles, 5 locales, and 4 judges.

Improvement tracked inversely with baseline score. Mistral and Deepseek — with the highest raw error counts — saw 42.1-44.6% reductions. Anthropic and Google — which already reflected more EU legal terminology in training — saw smaller gains. The pattern: RAL compensates for what the model doesn't already know.

Meanwhile, GEMBA-DA - the holistic score - reported a delta of 0.0007-0.0178 between raw and RAL across all providers. The same translations that MQM flagged for 16.6-44.6% more terminology errors received nearly identical holistic scores. This is the measurement gap: holistic evaluation at any granularity cannot detect terminology-level quality differences.

Total errors (all MQM categories) showed a smaller but consistent reduction across all five providers:

ProviderRaw totalRAL totalChange
Deepseek10,4239,014-13.5%
Mistral8,8467,812-11.7%
OpenAI7,5637,155-5.4%
Google7,7937,545-3.2%
Anthropic6,2326,039-3.1%

The gap between terminology reduction (16.6-44.6%) and total reduction (3.1-13.5%) is largely explained by style. LLM judges tend to flag text as "awkward" when it diverges from their training-data preferences, even when the divergence moves toward the official reference — a known limitation called self-preference bias. Terminology and accuracy are anchored against the reference; style has no anchor beyond the judge's own sense of what sounds natural.

Statistical significance#

Terminology error reduction was tested per provider using a paired Wilcoxon signed-rank test (one-sided, Holm-Bonferroni corrected across five providers). Per-paragraph terminology error counts were summed across four judges, then paired by paragraph (same source, same judges, raw vs RAL).

ProviderPaired paragraphsMean reduction/paragraph95% CICohen's dp (adjusted)
Mistral5322.80[2.42, 3.21]0.60< 0.001
Deepseek5262.94[2.45, 3.44]0.50< 0.001
OpenAI5351.44[1.12, 1.77]0.37< 0.001
Anthropic5330.71[0.50, 0.93]0.28< 0.001
Google5330.59[0.34, 0.85]0.20< 0.001

All five providers show statistically significant terminology error reductions (p < 0.001 after Holm-Bonferroni correction for multiple comparisons), with 95% confidence intervals excluding zero. Effect sizes range from medium-large (Mistral, d = 0.60) to small (Google, d = 0.20) — consistent with the pattern that models with lower baseline terminology coverage benefit more from RAL.

Where RAL matters most#

Portuguese showed the largest terminology improvements across all providers. Portuguese legal terminology diverges significantly from everyday Portuguese, and EU legal terms in Portuguese are underrepresented in LLM training data. French showed the smallest - French legal terms are well-represented in training corpora.

Case study: OpenAI Portuguese

OpenAI's raw output translated the EU AI Act into Portuguese using "alto risco" 71 times (the colloquial "high risk"), "fornecedores" 39 times, and "fornecedor" 36 times. The official EUR-Lex translations use "risco elevado" and "prestadores." With RAL, OpenAI Portuguese terminology errors dropped from 648 to 266 — a 59% reduction.

The pattern generalizes: locales whose domain terminology is further from the LLM's training distribution benefit more from RAL.

The mechanism#

The effective mechanism is straightforward. At inference time, the engine decomposes input text into n-gram phrases and embeds them. It then runs cosine similarity search against the glossary's vector index to find matching terms. Matched terms are injected into the LLM's context window alongside the source text. The model doesn't guess "fornecedor" or "prestador" — it sees the correct mapping in context and uses it. Structurally identical to RAG: embed, retrieve, inject, generate.

Provider ranking by raw quality#

Without RAL - raw model output only:

RankProviderMQM avg
1Anthropic0.955
2OpenAI0.942
3Google0.938
4Mistral0.915
5Deepseek0.883

The 0.072 gap between Anthropic and Deepseek represents roughly 3-4 additional errors per 100-word paragraph. RAL narrowed this gap: Mistral with RAL (0.940 avg) approached Google's raw quality (0.938). A model at a fraction of the per-token cost, augmented with a 72-term glossary, matched the terminology accuracy of a more expensive model without one.

What this means in production#

The quality gap between raw LLM output and production-ready localization is a context problem — and it compounds. After ten releases without RAL, three different wrong translations of "provider" coexist across the product.

RAL breaks this pattern. The glossary is persistent — it applies to every request, regardless of what changed. The 72-term glossary that reduced errors by 16.6-44.6% in our study is not a one-time improvement. It is a consistency layer across every translation request over the lifetime of the product.

Two findings for teams shipping LLM translations: first, holistic quality scores cannot detect terminology-level problems. GEMBA-DA — the WMT23 winning method — scored raw and RAL-augmented translations within 0.0007-0.0178 of each other. MQM counted 16.6-44.6% fewer terminology errors. If you evaluate at page level with a single score, you are not seeing the full picture.

Second, the fix is simpler than the problem suggests. A domain glossary injected at inference time reduced terminology errors across every provider we tested. The model that translates best (Anthropic, MQM 0.955) still improved. The model with the highest baseline error rate (Deepseek, MQM 0.883) improved most.

RAL is to localization what RAG is to generation: the engineering layer between the model and production.

Next steps#

Introducing Lingo.dev v1.0
The localization engineering platform built around RAL
Localization engines
Configure models, glossaries, and brand voice per locale

Platform

Localization APIAsync Jobs APILocalization EnginesLanguage DetectionLingo.dev Platform MCPPricing

Developer Tools

Lingo React MCPLingo CLILingo GitHub ActionLingo React Compiler
Alpha

Resources

DocumentationGuidesChangelogLanguagesLLM Models

Company

BlogResearchEnterpriseCareers
Hiring
humans.txt

Community

GitHubDiscordTwitterLinkedIn
HQed in San Francisco + worldwide
SOC 2 Type 2·CCPA·GDPR
Backed byY Combinator
Combinator
&Initialized Capital
Initialized Capital
&our customers
Privacy·Terms·Cookies·security.txt

© 2026 Lingo.dev (Replexica, Inc).

All systems normal
Sign inSign upTalk to a Localization Expert