Scorers#

Scorers are automated quality checks that evaluate translations produced by your localization engine. After each translation request, selected scorers run an independent LLM evaluation against your custom criteria — producing pass/fail verdicts or percentage scores, automatically, without manual review.

How it works#

When the localization engine completes a translation request, it checks which scorers match the request's locale pair. Each matching scorer with a passing sampling check is queued for asynchronous evaluation — scoring never blocks the translation response.

The scoring LLM receives the source text, translated output, locale pair, and your custom instruction. It returns a structured result: either a boolean pass/fail or a percentage score, with reasoning for imperfect results.

Scorer types#

Boolean scorers#

Return a binary verdict: pass or fail. Use these for rules that are either met or not.

Examples:

"Does the translation preserve all HTML tags and attributes?"
"Are pluralization rules applied correctly for the target language?"
"Does the translation use formal address (Sie) in German?"

Results are aggregated as pass rates — 75% means 3 out of 4 evaluated translations passed.

Percentage scorers#

Return a score from 0 to 100. Use these for quality dimensions that exist on a spectrum.

Examples:

"Rate the naturalness of the translation for a native speaker (0–100)"
"Score how well the translation preserves the original tone and intent (0–100)"
"Evaluate grammatical correctness on a scale of 0–100"

Results are aggregated as averages across the evaluation period.

Scorer configuration#

Field	Description
Name	A label identifying the scorer (e.g., "Pluralization check")
Instruction	The evaluation criteria, written in natural language
Type	`boolean` (pass/fail) or `percentage` (0–100)
Source locale	The source locale to match, or `*` for any
Target locale	The target locale to match, or `*` for any
Provider / Model	The LLM used for evaluation (independent of the translation model)
Sampling	Percentage of requests to evaluate (0–100%)
Allow N/A	Whether the scorer can return "not applicable" for irrelevant pairs
Enabled	Toggle scoring on or off without deleting the configuration

Writing scorer instructions#

The instruction field is the core of a scorer. It tells the evaluation LLM exactly what to check. Write it as a specific, testable criterion.

Good instructions#

Boolean:

text

Check whether all HTML tags in the source text are preserved
exactly in the translation. Tags must not be added, removed,
modified, or reordered. Pass if all tags are preserved, fail
if any tag is missing or altered.

Percentage:

text

Rate the fluency of the translation on a scale of 0-100.
100 means a native speaker would find it completely natural.
0 means it reads like machine output. Deduct points for
awkward phrasing, unnatural word order, or overly literal
constructions.

What makes a good instruction#

Specific criteria — define exactly what pass/fail means, or what 0 and 100 represent
Observable outcomes — the LLM should be able to evaluate by reading the text, not guessing intent
One concern per scorer — split multi-dimensional quality checks into separate scorers

Locale matching#

Scorers match translation requests by source and target locale. Wildcard * matches any locale.

Source locale	Target locale	Matches
`en`	`de`	Only English → German translations
`en`	`*`	Any translation from English
`*`	`ja`	Any translation into Japanese
`*`	`*`	All translations

A single translation request can trigger multiple scorers if several match its locale pair.

Sampling#

Not every translation needs to be scored. The sampling rate controls what percentage of matching requests get evaluated.

Sampling	Behavior
100%	Every matching request is scored (thorough but higher cost)
50%	Roughly half of matching requests are scored
10%	One in ten — useful for high-volume engines where trends matter more than individual scores
0%	Scorer is effectively paused without disabling it

Sampling is applied at request time using a random check. Over a sufficient volume of requests, the actual evaluation rate converges to the configured percentage.

N/A support#

When allowsNA is enabled, the scoring LLM can return "not applicable" instead of a score. This is useful for scorers whose criteria don't apply to every locale pair.

Example: A scorer checking formal address conventions returns N/A for English → English translations (English has no formal/informal distinction), but returns a score for English → German.

N/A results are excluded from averages and pass rates in reporting — they don't pull scores down or inflate them.

Reasoning#

Scorers provide reasoning for imperfect results to help you understand what went wrong:

Perfect score (pass or 100%) — reasoning is null (nothing to explain)
N/A — reasoning is null
Imperfect score — a brief one-sentence explanation

This keeps the scoring results actionable: when a translation fails a check, the reasoning tells you why without manual investigation.

Scoring model#

Each scorer has its own LLM provider and model configuration, independent of the translation model. This separation is intentional — the model that produces the translation should not be the same model that evaluates it.

Model independence

Using a different model for scoring than for translation provides an independent assessment. If GPT-4o produces the translation, evaluating with Claude Sonnet gives you a second opinion rather than self-assessment.

Scorer reports#

Scoring results are visualized in the dashboard under the scorer reports section, showing:

Pass rates over time — for boolean scorers, plotted as daily percentages
Average scores over time — for percentage scorers, plotted as daily averages
Per-locale-pair breakdown — see how each source → target pair performs independently
Aggregate view — combine all locale pairs into a single trend line

Scorer reports complement the volume-focused Reports — together they give you a complete picture of both throughput and quality.

Managing scorers via MCP#

If you use the Lingo.dev MCP server, your AI coding assistant can create and configure scorers directly:

text

"Create a boolean scorer for all locale pairs that checks
whether HTML tags are preserved in translations."

text

"Add a percentage scorer for English to German that rates
translation fluency on a 0-100 scale, sampling 50% of requests."

Next Steps#

Reports

Monitor translation volume, token usage, and locale coverage

LLM Models

Configure the translation models that scorers evaluate

Glossaries

Set up terms that glossary compliance scorers can check against

API Reference

Integrate the localization API into your workflow

Scorers#

How it works#

Scorer types#

Boolean scorers#

Return a binary verdict: pass or fail. Use these for rules that are either met or not.

Examples:

"Does the translation preserve all HTML tags and attributes?"
"Are pluralization rules applied correctly for the target language?"
"Does the translation use formal address (Sie) in German?"

Results are aggregated as pass rates — 75% means 3 out of 4 evaluated translations passed.

Percentage scorers#

Return a score from 0 to 100. Use these for quality dimensions that exist on a spectrum.

Examples:

"Rate the naturalness of the translation for a native speaker (0–100)"
"Score how well the translation preserves the original tone and intent (0–100)"
"Evaluate grammatical correctness on a scale of 0–100"

Results are aggregated as averages across the evaluation period.

Scorer configuration#

Field	Description
Name	A label identifying the scorer (e.g., "Pluralization check")
Instruction	The evaluation criteria, written in natural language
Type	`boolean` (pass/fail) or `percentage` (0–100)
Source locale	The source locale to match, or `*` for any
Target locale	The target locale to match, or `*` for any
Provider / Model	The LLM used for evaluation (independent of the translation model)
Sampling	Percentage of requests to evaluate (0–100%)
Allow N/A	Whether the scorer can return "not applicable" for irrelevant pairs
Enabled	Toggle scoring on or off without deleting the configuration

Writing scorer instructions#

The instruction field is the core of a scorer. It tells the evaluation LLM exactly what to check. Write it as a specific, testable criterion.

Good instructions#

Boolean:

text

Check whether all HTML tags in the source text are preserved
exactly in the translation. Tags must not be added, removed,
modified, or reordered. Pass if all tags are preserved, fail
if any tag is missing or altered.

Percentage:

text

Rate the fluency of the translation on a scale of 0-100.
100 means a native speaker would find it completely natural.
0 means it reads like machine output. Deduct points for
awkward phrasing, unnatural word order, or overly literal
constructions.

What makes a good instruction#

Specific criteria — define exactly what pass/fail means, or what 0 and 100 represent
Observable outcomes — the LLM should be able to evaluate by reading the text, not guessing intent
One concern per scorer — split multi-dimensional quality checks into separate scorers

Locale matching#

Scorers match translation requests by source and target locale. Wildcard * matches any locale.

Source locale	Target locale	Matches
`en`	`de`	Only English → German translations
`en`	`*`	Any translation from English
`*`	`ja`	Any translation into Japanese
`*`	`*`	All translations

A single translation request can trigger multiple scorers if several match its locale pair.

Sampling#

Not every translation needs to be scored. The sampling rate controls what percentage of matching requests get evaluated.

Sampling	Behavior
100%	Every matching request is scored (thorough but higher cost)
50%	Roughly half of matching requests are scored
10%	One in ten — useful for high-volume engines where trends matter more than individual scores
0%	Scorer is effectively paused without disabling it

Sampling is applied at request time using a random check. Over a sufficient volume of requests, the actual evaluation rate converges to the configured percentage.

N/A support#

When allowsNA is enabled, the scoring LLM can return "not applicable" instead of a score. This is useful for scorers whose criteria don't apply to every locale pair.

Example: A scorer checking formal address conventions returns N/A for English → English translations (English has no formal/informal distinction), but returns a score for English → German.

N/A results are excluded from averages and pass rates in reporting — they don't pull scores down or inflate them.

Reasoning#

Scorers provide reasoning for imperfect results to help you understand what went wrong:

Perfect score (pass or 100%) — reasoning is null (nothing to explain)
N/A — reasoning is null
Imperfect score — a brief one-sentence explanation

This keeps the scoring results actionable: when a translation fails a check, the reasoning tells you why without manual investigation.

Scoring model#

Model independence

Scorer reports#

Scoring results are visualized in the dashboard under the scorer reports section, showing:

Pass rates over time — for boolean scorers, plotted as daily percentages
Average scores over time — for percentage scorers, plotted as daily averages
Per-locale-pair breakdown — see how each source → target pair performs independently
Aggregate view — combine all locale pairs into a single trend line

Scorer reports complement the volume-focused Reports — together they give you a complete picture of both throughput and quality.

Managing scorers via MCP#

If you use the Lingo.dev MCP server, your AI coding assistant can create and configure scorers directly:

text

"Create a boolean scorer for all locale pairs that checks
whether HTML tags are preserved in translations."

text

"Add a percentage scorer for English to German that rates
translation fluency on a 0-100 scale, sampling 50% of requests."

Next Steps#

Reports

Monitor translation volume, token usage, and locale coverage

LLM Models

Configure the translation models that scorers evaluate

Glossaries

Set up terms that glossary compliance scorers can check against

API Reference

Integrate the localization API into your workflow