Pearmut: Human Evaluation of Translation Made Trivial
Vilém Zouhar, Tom Kocmi
2026-01-08
Summary
This paper introduces Pearmut, a new platform designed to make it much easier to get real people to evaluate how well computer programs understand and generate different languages, specifically focusing on machine translation.
What's the problem?
Evaluating how well a computer program performs tasks like translation is usually done by humans, which is considered the most accurate method. However, getting human feedback is really difficult and time-consuming because it requires a lot of setup, technical expertise, and ongoing management. Because of this, developers often rely on automatic metrics, which aren't always as reliable.
What's the solution?
The researchers created Pearmut, a platform that simplifies the entire process of human evaluation. It supports common evaluation methods and allows for creating new ones, provides important context to the evaluators, includes checks to ensure quality feedback, and even helps decide which people should evaluate which translations. It aims to make human evaluation as straightforward as using automatic evaluation tools.
Why it matters?
Pearmut makes reliable human evaluation a practical part of developing and improving language models. Instead of being a rare event, it can become a regular step in the process, leading to better and more accurate AI systems that can truly understand and communicate in multiple languages.
Abstract
Human evaluation is the gold standard for multilingual NLP, but is often skipped in practice and substituted with automatic metrics, because it is notoriously complex and slow to set up with existing tools with substantial engineering and operational overhead. We introduce Pearmut, a lightweight yet feature-rich platform that makes end-to-end human evaluation as easy to run as automatic evaluation. Pearmut removes common entry barriers and provides support for evaluating multilingual tasks, with a particular focus on machine translation. The platform implements standard evaluation protocols, including DA, ESA, or MQM, but is also extensible to allow prototyping new protocols. It features document-level context, absolute and contrastive evaluation, attention checks, ESAAI pre-annotations and both static and active learning-based assignment strategies. Pearmut enables reliable human evaluation to become a practical, routine component of model development and diagnosis rather than an occasional effort.