DocsPricingResearchEnterpriseCareers
Hiring
Sign inSign upBook a Demo
All posts

Every RAG-based localization pipeline has the same blind spot

Veronica PrilutskayaVeronica Prilutskaya, CPO & Co-Founder·Published in about 12 hours·10 min read

If a localization pipeline uses retrieval augmented generation to inject glossary terms into the model's context window, it has a retrieval recall problem that has never been measured.

The pattern is universal: embed the input text, cosine-search a term bank, inject top-k results into the prompt. The output is grammatically correct. The terminology is wrong. The error is invisible unless someone speaks both languages and knows the glossary.

We built this naive version first. Then we measured retrieval recall against production glossaries – and it turned out the system was missing the majority of applicable terms on real payloads.

TechniqueRetrieval augmented localization (RAL) – context enrichment at inference time
Core fixN-gram decomposition before embedding, not sentence-level embedding
Retrieval modes3 (skip / preload / vector search), selected per-request by glossary cardinality
Threshold calibrationContinuous, weekly, against per-locale-pair quality scores
Terminology error reduction17–45% across five LLM providers (controlled study, 42,000+ quality judgments)
ScoringIndependent cross-model evaluation, asynchronous, per-request

Why do sentence embeddings miss glossary terms?#

A glossary term is 1–3 words. "Localization engine." "Access token." "Deployment pipeline."

Input text is a JSON object with values ranging from two words (a button label) to two hundred words (a product description). When the full string "Configure the localization engine for production deployment" is embedded, the resulting vector captures the semantic meaning of the sentence – something about configuration and production systems. The glossary-relevant phrase "localization engine" dissolves into the sentence-level representation.

Cosine similarity between that sentence vector and the glossary entry "localization engine" lands in the 0.6–0.7 range. Below retrieval threshold. The term exists in the input. The retrieval system misses it.

The issue is granularity: sentence-level representations querying phrase-level targets. The embedding model faithfully represents the meaning of the sentence as a whole. Constituent terminology occupies no independent region of the vector space.

We found this out the hard way. On production payloads – nested JSON objects with 20–50 keys, values of varying length – sentence-level retrieval was missing the majority of applicable glossary terms. The localization request completed fine. The output read fluently. But "localization engine" was becoming "translation tool" – grammatically valid, semantically adjacent, terminologically wrong. And the pipeline reported success.

How does n-gram decomposition fix glossary retrieval?#

The fix turned out to be decomposing input into phrase-level units before embedding. Every string value becomes a set of overlapping n-gram windows:

text
Input: "Configure the localization engine for production"

1-grams: [configure, the, localization, engine, for, production]
2-grams: [configure the, the localization, localization engine,
          engine for, for production]
3-grams: [configure the localization, the localization engine,
          localization engine for, engine for production]

Each n-gram becomes an independent retrieval query. "Localization engine" queries the glossary as a standalone phrase – and finds its match at high similarity.

The decomposition pipeline:

  1. Recursively extract all string values from nested JSON structures
  2. Split into sentences, strip HTML and markup annotations
  3. Normalize whitespace, remove enclosing quotes, unescape formatting
  4. Generate overlapping 1-gram, 2-gram, and 3-gram phrases from each sentence

A 50-word paragraph yields approximately 150 n-grams. A typical API payload with 20 keys yields 1,000–3,000 searchable phrases. Each phrase is embedded independently, each embedding runs a nearest-neighbor query against the glossary's vector index.

We measured the difference on the same production payloads that exposed the original problem. Glossary terms now match regardless of the sentence context surrounding them – a 2-word term buried in a 200-word product description retrieves with the same recall as a standalone label.

How does adaptive retrieval work for different glossary sizes?#

N-gram decomposition and batch embedding is the correct approach for large glossaries. For small ones, it turned out to be computationally wasteful.

A localization engine configured with 8 glossary terms resolves faster with direct injection – one database query, deterministic, sub-millisecond. A localization engine with 2,000 terms requires vector search – context window limits and relevance dilution make full injection impossible.

Three retrieval modes operate per-request, selected based on glossary cardinality for the locale pair:

ModeConditionBehavior
SkipZero matching itemsNo embedding, no search, no injection
PreloadBelow cardinality thresholdSingle database query loads all matching items; direct injection
SearchAbove cardinality thresholdFull n-gram decomposition → batch embedding → vector nearest-neighbor search

The cardinality threshold that separates preload from search is derived from latency profiling across production traffic and adjusted as embedding model performance, glossary size distributions, and infrastructure characteristics shift. The initial value we shipped lasted approximately three weeks before telemetry indicated it should move. It has been adjusted multiple times since – we discovered that the optimal threshold drifts as engines accumulate glossary terms and embedding model characteristics evolve between provider updates.

Retrieval latency scales with glossary complexity, not payload size. A localization engine with 10 terms resolves in single-digit milliseconds regardless of input length. A localization engine with 500 terms uses the full decomposition pipeline but resolves within the latency budget of a durable background workflow step.

How is the similarity threshold calibrated for glossary retrieval?#

Each n-gram embedding queries the vector index for nearest neighbors above a similarity threshold. Matches below the threshold are discarded as noise.

The threshold determines retrieval precision and recall simultaneously:

  • Too permissive: unrelated terms leak into the prompt. The model sees glossary context that does not apply to the input and occasionally follows it – producing output that uses terminology from an unrelated domain.
  • Too strict: legitimate variant phrasings and morphological forms get excluded. "Deploying" fails to match the glossary entry for "deploy." Recall drops.

We found that the right threshold varies by locale pair. English→German retrieval has different similarity distributions than English→Japanese, where morphological distance between source n-grams and glossary entries differs structurally. A single global threshold was producing inconsistent recall across the locale pairs we measured.

The threshold is now calibrated continuously against per-locale-pair quality scores from an independent scoring pipeline. When the scoring system detects an increase in glossary non-adherence (terms present in input but absent from output), retrieval recall has degraded and the threshold is loosened. When scoring detects the model applying irrelevant terminology, false-positive injection has increased and the threshold is tightened.

This calibration runs weekly. It has to – embedding model behavior shifts between provider updates, glossary distributions change as teams add terms, and input text characteristics evolve as products grow.

How are retrieved glossary terms injected into the localization model?#

Retrieved glossary items split into two constraint classes with different enforcement behavior in the model's system prompt:

Non-translatable terms – source-language strings that must appear unchanged in the target output. Brand names, technical identifiers, product names. The model preserves these verbatim.

Custom translations – source→target mappings that override the model's own judgment. "Localization engine" must become "moteur de localisation." The model treats these as non-negotiable lexical constraints.

Both classes are injected into the system prompt as rules with explicit precedence over the model's default behavior. The prompt hierarchy enforces glossary compliance above the model's linguistic preferences.

The distinction matters at scoring time: the independent scoring model checks whether non-translatables were preserved unchanged and whether custom translations were applied exactly. Two verification criteria for two constraint types. We discovered early that conflating them into a single "glossary" category made scoring unreliable – a term preserved verbatim when it should have been translated (or vice versa) would score as correct under a unified check.

How do you validate localization quality in languages you don't speak?#

The entire retrieval and localization pipeline can execute without error and produce terminologically incorrect output. A missed glossary term produces no error signal. A misapplied custom translation returns a 200. The pipeline succeeds. The output is wrong.

This is the localization observability gap that most teams never close.

Retrieval is coupled with independent asynchronous scoring. After a localization request completes, separate scoring models evaluate the output against the localization engine's configuration:

  • Glossary adherence – were non-translatable terms preserved? Were custom translations applied exactly?
  • Instruction adherence – were locale-specific rules followed?
  • Custom scoring criteria – per-engine quality dimensions defined by the localization team

The scoring models run on different infrastructure than the localization model. They operate asynchronously in background workflows, triggered after every request that passes through a localization engine with scoring enabled. One model localizes; a different model scores. Cross-model evaluation removes the self-grading problem.

Scoring results feed back into retrieval calibration:

  1. Scoring detects glossary non-adherence trending upward for a locale pair
  2. Investigation reveals retrieval recall has dropped – the threshold has drifted relative to the current glossary distribution
  3. Threshold is adjusted; recall recovers; adherence scores stabilize

The loop is what makes the system self-correcting. Scoring creates the observability that retrieval alone lacks. Without it, teams are shipping localized content into languages they do not speak, with no signal on whether the glossary they built is actually being applied.

Why does retrieval recall compound over time?#

Every localization request that correctly applies glossary terms reinforces terminology consistency across the product. Every request that misses a term introduces drift – one surface says "localization engine," another says "localization tool," a third says "localization module." Across 30 locales and weekly releases, these inconsistencies compound.

The difference between high and low retrieval recall is not a per-request quality delta. It is a compounding consistency mechanism. High recall means the glossary enforces uniformly across every surface, every locale, every release. Low recall means the glossary occasionally fires – structurally equivalent to having no glossary, just slower to degrade.

What this means for localization engineering#

The retrieval problem described here is not specific to one implementation. It is structural to any system that attempts glossary-aware localization using embedding-based search. The granularity mismatch between sentence-level input representations and phrase-level glossary targets exists regardless of which embedding model, which vector database, or which LLM generates the output.

Teams building localization automation face a choice: accept sentence-level retrieval with its invisible recall gap, or build the decomposition and calibration infrastructure that closes it. The second path requires three systems – n-gram decomposition, adaptive retrieval, and a scoring loop that feeds back into threshold management. Each system has its own operational cadence: decomposition logic evolves as input formats change, retrieval thresholds shift as glossaries grow, and scoring criteria are refined as localization teams learn what dimensions matter for their content.

Retrieval augmented localization at production quality is an ongoing engineering practice – a system that is built, instrumented, observed, and tuned continuously. The localization engineering discipline emerging around this work reflects the operational reality: localization infrastructure requires the same continuous attention that backend services, CI/CD pipelines, and observability stacks demand.


Next steps#

RAL research
Controlled study: 42,000+ quality judgments, 17–45% terminology error reduction
Localization engines
Configure glossary, brand voice, model chains, and AI reviewers
The Localization API
The async API that runs this pipeline behind a single POST

FAQ#

What is retrieval augmented localization (RAL)? Retrieval augmented localization enriches each localization request with glossary terms, brand voice rules, and locale-specific instructions at inference time – the same retrieve-inject pattern behind RAG, applied to localization. In a controlled study across five LLM providers and five European languages, RAL reduced terminology errors by 17–45% compared to the same models without context enrichment.

Why does sentence-level embedding miss glossary terms? Glossary terms are typically 1–3 words. When embedded as part of a full sentence, they dissolve into the sentence-level semantic vector. The embedding captures the meaning of the sentence as a whole – "localization engine" inside "Configure the localization engine for production" does not independently register. Cosine similarity between the sentence vector and the glossary entry falls below retrieval threshold.

How does n-gram decomposition improve retrieval recall? Instead of embedding full input strings, the system decomposes text into overlapping 1-gram, 2-gram, and 3-gram phrases before embedding. Each phrase becomes an independent retrieval query. A 2-word glossary term buried in a 200-word paragraph matches at the same recall as a standalone label – because it is queried independently of its surrounding context.

How many retrieval modes does the system use? Three. Skip (zero glossary items – no retrieval needed), preload (below a cardinality threshold – load all items directly), and vector search (above threshold – full n-gram decomposition and embedding). The mode is selected per-request based on glossary cardinality for the specific locale pair.

How is the similarity threshold maintained? The threshold is calibrated weekly against per-locale-pair quality scores from an independent scoring pipeline. When glossary non-adherence trends upward, the threshold is loosened to improve recall. When irrelevant terms leak into prompts, the threshold is tightened. Different locale pairs require different thresholds due to varying morphological distances.

How does cross-model scoring work for localization quality? After each localization request completes, a separate model – running on different infrastructure – evaluates whether glossary terms were correctly applied, whether locale-specific instructions were followed, and whether custom quality criteria were met. One model localizes; a different model scores. This removes self-grading bias and creates the observability that retrieval alone lacks.

What happens when glossary retrieval recall is low? Low retrieval recall means the glossary fires inconsistently – one surface gets the correct term, another does not. Across 30+ locales and weekly releases, these inconsistencies compound into terminology drift. The glossary exists but does not enforce. Over months, this is structurally equivalent to having no glossary.

What is the localization observability gap? A localization pipeline can execute without error and produce terminologically incorrect output. Missed glossary terms produce no error signal – the API returns 200, the translation is grammatically valid. The observability gap is the space between "pipeline succeeded" and "terminology is correct." Independent scoring closes this gap by measuring glossary adherence on every request.

Platform

Localization APIAsync Jobs APILocalization EnginesLanguage DetectionLingo.dev Platform MCPPricing

Developer Tools

Lingo React MCPLingo CLILingo GitHub ActionLingo React Compiler
Alpha

Resources

DocumentationLabsGuidesChangelogLanguagesLLM Models

Company

BlogResearchBook a DemoCustomersCareers
Hiring
humans.txt

Community

GitHubDiscordTwitterLinkedIn
HQed in San Francisco + worldwide
SOC 2 Type II·CCPA·GDPR
Backed byY Combinator
Combinator
&Initialized Capital
Initialized Capital
&our customers
Privacy·Terms·Cookies·security.txt

© 2026 Lingo.dev (Replexica, Inc).

All systems normal
Sign inSign upBook a Demo