DiscoX: Benchmarking Discourse-Level Translation task in Expert Domains
Xiying Zhao, Zhoufutu Wen, Zhixuan Chen, Jingzhe Ding, Jianpeng Jiao, Shuai Li, Xi Li, Danni Liang, Shengda Long, Qianqian Liu, Xianbo Wu, Hongwan Gao, Xiang Gao, Liang Hu, Jiashuo Liu, Mengyun Liu, Weiran Shi, Chenghao Yang, Qianyu Yang, Xuanliang Zhang, Ge Zhang, Wenhao Huang
2025-11-17
Summary
This paper focuses on the difficulty of accurately evaluating how well machines translate complex, specialized texts, specifically from Chinese to English.
What's the problem?
Currently, we aren't very good at judging the quality of translations when it comes to things like making sure the overall text makes sense and uses the correct technical terms, especially in fields like science or engineering. Most automatic evaluation tools just check if individual sentences are grammatically correct and sound natural, but they miss bigger issues with how the whole text flows and whether it accurately conveys the expert knowledge.
What's the solution?
The researchers created a new, challenging test set called DiscoX, which includes 200 professionally translated documents from 7 different expert areas, each quite long. They also developed a new automatic evaluation tool, Metric-S, that tries to assess translations based on accuracy, how well it reads, and whether it’s appropriate for the subject matter. Metric-S is better at matching human judgments than existing tools.
Why it matters?
This work is important because it shows that even the best current AI translation systems still have a long way to go before they can produce translations that are as good as those created by human experts in specialized fields. The new test set and evaluation tool will help researchers develop and improve machine translation systems for professional use, ultimately making it easier to share knowledge across languages.
Abstract
The evaluation of discourse-level translation in expert domains remains inadequate, despite its centrality to knowledge dissemination and cross-lingual scholarly communication. While these translations demand discourse-level coherence and strict terminological precision, current evaluation methods predominantly focus on segment-level accuracy and fluency. To address this limitation, we introduce DiscoX, a new benchmark for discourse-level and expert-level Chinese-English translation. It comprises 200 professionally-curated texts from 7 domains, with an average length exceeding 1700 tokens. To evaluate performance on DiscoX, we also develop Metric-S, a reference-free system that provides fine-grained automatic assessments across accuracy, fluency, and appropriateness. Metric-S demonstrates strong consistency with human judgments, significantly outperforming existing metrics. Our experiments reveal a remarkable performance gap: even the most advanced LLMs still trail human experts on these tasks. This finding validates the difficulty of DiscoX and underscores the challenges that remain in achieving professional-grade machine translation. The proposed benchmark and evaluation system provide a robust framework for more rigorous evaluation, facilitating future advancements in LLM-based translation.