Are Large Reasoning Models Good Translation Evaluators? Analysis and Performance Boost
Runzhe Zhan, Zhihong Huang, Xinyi Yang, Lidia S. Chao, Min Yang, Derek F. Wong
2025-10-27
Summary
This paper investigates whether powerful AI models, specifically large reasoning models, can be used to automatically assess the quality of machine translation, like Google Translate. These models are good at 'thinking through' problems, and the researchers wanted to see if that ability could be applied to judging how well a translation captures the meaning of the original text.
What's the problem?
Currently, evaluating machine translation is tricky. While there are automatic metrics, they aren't always great at capturing nuanced errors. Human evaluation is accurate but slow and expensive. The researchers found that when using these large reasoning models as judges, they often struggled. They tended to overcomplicate simple translations, got confused easily, and generally gave overly positive scores, meaning they weren't accurately identifying problems with the translations.
What's the solution?
To fix this, the researchers 'trained' the large reasoning models to think more like humans when evaluating translations. They did this by showing the models examples of how people actually think through the process of judging a translation – breaking down the steps and reasoning. This helped the models learn to focus on the important aspects and avoid getting bogged down in unnecessary details, effectively making them more efficient and accurate judges.
Why it matters?
This research is important because it shows a path towards creating a faster, cheaper, and more reliable way to automatically evaluate machine translation. If we can get AI to accurately judge translations, it will help improve the quality of translation systems and make communication across languages easier. The fact that they significantly reduced the amount of 'thinking' the models needed to do while *improving* accuracy is a big step forward in making this practical.
Abstract
Recent advancements in large reasoning models (LRMs) have introduced an intermediate "thinking" process prior to generating final answers, improving their reasoning capabilities on complex downstream tasks. However, the potential of LRMs as evaluators for machine translation (MT) quality remains underexplored. We provides the first systematic analysis of LRM-as-a-judge in MT evaluation. We identify key challenges, revealing LRMs require tailored evaluation materials, tend to "overthink" simpler instances and have issues with scoring mechanisms leading to overestimation. To address these, we propose to calibrate LRM thinking by training them on synthetic, human-like thinking trajectories. Our experiments on WMT24 Metrics benchmarks demonstrate that this approach largely reduces thinking budgets by ~35x while concurrently improving evaluation performance across different LRM scales from 7B to 32B (e.g., R1-Distill-Qwen-7B achieves a +8.7 correlation point improvement). These findings highlight the potential of efficiently calibrated LRMs to advance fine-grained automatic MT evaluation.