QE4PE: Word-level Quality Estimation for Human Post-Editing
Gabriele Sarti, Vilém Zouhar, Grzegorz Chrupała, Ana Guerberof-Arenas, Malvina Nissim, Arianna Bisazza
2025-03-06
Summary
This paper talks about QE4PE, a method that helps identify errors in machine-translated text at the word level, making it easier for humans to edit and improve translations.
What's the problem?
Machine translations often have mistakes, and while there are tools to find these errors, they haven't been tested enough to see how much they actually help professional editors work faster or produce better results. This creates a gap between how accurate these tools are and how useful they are in real-world editing tasks.
What's the solution?
The researchers studied how word-level error detection impacts human editing by testing it with 42 professional editors working on translations in two languages. They compared different methods of highlighting errors, including ones based on AI predictions and others based on uncertainty. They measured how these methods affected editing speed and quality, finding that factors like the type of text and the editor's speed influenced how helpful the highlights were.
Why it matters?
This matters because it bridges the gap between AI tools and professional workflows, showing how error detection can make editing more efficient. By improving these tools, it could lead to faster, higher-quality translations in industries like publishing, business, and international communication.
Abstract
Word-level quality estimation (QE) detects erroneous spans in machine translations, which can direct and facilitate human post-editing. While the accuracy of word-level QE systems has been assessed extensively, their usability and downstream influence on the speed, quality and editing choices of human post-editing remain understudied. Our QE4PE study investigates the impact of word-level QE on machine translation (MT) post-editing in a realistic setting involving 42 professional post-editors across two translation directions. We compare four error-span highlight modalities, including supervised and uncertainty-based word-level QE methods, for identifying potential errors in the outputs of a state-of-the-art neural MT model. Post-editing effort and productivity are estimated by behavioral logs, while quality improvements are assessed by word- and segment-level human annotation. We find that domain, language and editors' speed are critical factors in determining highlights' effectiveness, with modest differences between human-made and automated QE highlights underlining a gap between accuracy and usability in professional workflows.