Test the same content through two engine configurations to evaluate a change before committing.
The workflow#
"Compare our production engine against the staging engine on these 5 strings for Japanese"
What happens:
- The assistant localizes the content through both engines
- Presents results in a side-by-side table
- Highlights differences: "The staging engine applies the new glossary term for 'onboarding' (オンボーディング) while production still uses the descriptive localization (導入手続き)"
When to use this#
- After tuning — verify the change improved output before promoting
- Evaluating model changes — same config, different primary model
- Testing glossary impact — with and without new terms
- Comparing engines for different use cases — marketing vs. technical content
Example comparisons#
Before/after a tune#
"Localize 'Welcome to your new workspace' to German through engine A and engine B"
Shows whether the glossary entry for "workspace" is being preserved in the updated engine.
Model evaluation#
"I switched the Japanese model from GPT-4.1 to Claude Sonnet. Compare outputs for these 10 UI strings."
Side-by-side reveals which model handles short UI strings vs. longer descriptions better for your specific domain.
Glossary depth testing#
"Compare the engine with our full 200-term glossary against a fresh engine with no glossary on these legal strings"
Quantifies how much the glossary contributes to output quality for a specific content type.
