|
Documentation
Book a DemoPlatform
PlatformMCP
CLIAPIWorkflows
GuidesChangelog

Getting Started

  • How it works
  • Setup
  • Capabilities

Workflows

  • Create engine
  • Import glossary
  • Localize content
  • Inspect requests
  • Investigate
  • Tune engine
  • Spot-check
  • Compare engines
  • Add locale

Debug Localization Quality

Max PrilutskiyMax Prilutskiy·Updated 1 day ago·2 min read

When a localization comes out wrong, the MCP server gives your AI assistant access to the full observability stack - request logs, scorer verdicts, glossary matching reports, and instruction review results. Debug quality without leaving the conversation.

Request logs#

Every localization request produces a log entry with the full execution context: which model handled it, input and output tokens, duration, whether a fallback was triggered, and the complete input/output data.

"Show me the last request log for the German engine"

The assistant retrieves the log and can answer follow-up questions: "Did it use the fallback model?" "How many tokens did it consume?" "What was the raw output?"

What each log contains#

FieldWhat it tells you
Provider / modelWhich LLM handled the request
Input / output dataExact input sent and localization received
Input / output tokensToken consumption
DurationProcessing time in milliseconds
Used fallbackWhether the primary model failed and fallback kicked in
Statussuccess, error, or in_progress
Error textError detail when status is error
Trigger typeWhether the request came from API, CLI, CI, playground, or integration

AI Reviewer verdicts#

Each request log links to scorer run logs - the independent AI Reviewer evaluations that ran after the localization was produced.

"Did the last German localization pass all scorers?"

The assistant retrieves scorer run logs for a given request and reports each scorer's verdict: pass/fail (boolean scorers) or percentage score, along with the reasoning the reviewer produced.

Scorer run log fields#

FieldWhat it tells you
Scorer nameWhich AI Reviewer ran
Scorer typeboolean (pass/fail) or percentage (0-100)
Score resultThe verdict and reasoning
Provider / modelWhich model performed the review
DurationHow long the review took

Glossary compliance#

"Were all glossary terms applied correctly in that localization?"

The assistant retrieves the glossary review log for a request, showing each matched glossary term, whether it was applied, and the reasoning if it wasn't.

The report includes:

  • Each source term matched
  • The expected target localization
  • Whether the term is a custom localization or non-translatable
  • Applied or not applied per term
  • Reasoning when a term wasn't applied
  • Overall compliance rate

Instruction adherence#

"Did the French localization follow the non-breaking space instruction?"

The assistant retrieves instruction review logs - one entry per instruction that was evaluated against the localization output. Each shows the instruction name, the rule text, and a pass/fail verdict with reasoning.

The debugging workflow#

A typical post-mortem conversation:

  1. "The German localization of 'checkout flow' looks wrong"
  2. "Show me the request log for that" - see what went in and came out
  3. "Did the glossary apply?" - check if 'checkout' was matched and preserved
  4. "What did the scorers say?" - see if any AI Reviewer flagged it
  5. "The glossary term wasn't matched - update it to also cover 'checkout flow'" - fix the root cause

The entire loop happens in one conversation, without opening the dashboard.

Was this page helpful?