Debug Localization Quality

When a localization comes out wrong, the MCP server gives your AI assistant access to the full observability stack - request logs, scorer verdicts, glossary matching reports, and instruction review results. Debug quality without leaving the conversation.

Request logs#

Every localization request produces a log entry with the full execution context: which model handled it, input and output tokens, duration, whether a fallback was triggered, and the complete input/output data.

"Show me the last request log for the German engine"

The assistant retrieves the log and can answer follow-up questions: "Did it use the fallback model?" "How many tokens did it consume?" "What was the raw output?"

What each log contains#

Field	What it tells you
Provider / model	Which LLM handled the request
Input / output data	Exact input sent and localization received
Input / output tokens	Token consumption
Duration	Processing time in milliseconds
Used fallback	Whether the primary model failed and fallback kicked in
Status	`success`, `error`, or `in_progress`
Error text	Error detail when status is `error`
Trigger type	Whether the request came from API, CLI, CI, playground, or integration

AI Reviewer verdicts#

Each request log links to scorer run logs - the independent AI Reviewer evaluations that ran after the localization was produced.

"Did the last German localization pass all scorers?"

The assistant retrieves scorer run logs for a given request and reports each scorer's verdict: pass/fail (boolean scorers) or percentage score, along with the reasoning the reviewer produced.

Scorer run log fields#

Field	What it tells you
Scorer name	Which AI Reviewer ran
Scorer type	`boolean` (pass/fail) or `percentage` (0-100)
Score result	The verdict and reasoning
Provider / model	Which model performed the review
Duration	How long the review took

Glossary compliance#

"Were all glossary terms applied correctly in that localization?"

The assistant retrieves the glossary review log for a request, showing each matched glossary term, whether it was applied, and the reasoning if it wasn't.

The report includes:

Each source term matched
The expected target localization
Whether the term is a custom localization or non-translatable
Applied or not applied per term
Reasoning when a term wasn't applied
Overall compliance rate

Instruction adherence#

"Did the French localization follow the non-breaking space instruction?"

The assistant retrieves instruction review logs - one entry per instruction that was evaluated against the localization output. Each shows the instruction name, the rule text, and a pass/fail verdict with reasoning.

The debugging workflow#

A typical post-mortem conversation:

"The German localization of 'checkout flow' looks wrong"
"Show me the request log for that" - see what went in and came out
"Did the glossary apply?" - check if 'checkout' was matched and preserved
"What did the scorers say?" - see if any AI Reviewer flagged it
"The glossary term wasn't matched - update it to also cover 'checkout flow'" - fix the root cause

The entire loop happens in one conversation, without opening the dashboard.