DITING: A Multi-Agent Evaluation Framework for Benchmarking Web Novel Translation

Enze Zhang, Jiaying Wang, Mengxi Xiao, Jifei Liu, Ziyan Kuang, Rui Dong, Eric Dong, Sophia Ananiadou, Min Peng, Qianqian Xie

2025-10-15

DITING: A Multi-Agent Evaluation Framework for Benchmarking Web Novel Translation

Summary

This paper investigates how well large language models (LLMs) translate web novels, a type of writing that has unique challenges compared to other texts.

What's the problem?

Current methods for judging machine translation quality don't really capture what makes a good web novel translation, focusing too much on word-for-word accuracy and not enough on things like maintaining the story's feel, cultural references, and how characters are understood. Existing tests don't adequately assess if the translation feels natural and true to the original story's spirit.

What's the solution?

The researchers created a new evaluation system called DITING specifically for web novel translation, looking at six key areas: how idioms are handled, dealing with words that have multiple meanings, making sure specialized terms are translated appropriately, keeping the tense consistent, understanding when pronouns are left out (like 'he' or 'she' in some languages), and ensuring the translation is culturally sensitive. They also developed a way to automatically assess translations, called AgentEval, that tries to mimic how human experts would debate and judge the quality, and a dataset called MetricAlign to compare different evaluation methods. They tested many different LLMs, including some trained in China and others trained elsewhere.

Why it matters?

This work is important because it provides a better way to evaluate machine translation for web novels, which are becoming increasingly popular. It shows that LLMs trained on Chinese data are surprisingly good at this task, even better than larger models trained in other countries, and identifies DeepSeek-V3 as a particularly strong performer. The resources they created will help future research in this area and ultimately lead to better translations for readers.

Abstract

Large language models (LLMs) have substantially advanced machine translation (MT), yet their effectiveness in translating web novels remains unclear. Existing benchmarks rely on surface-level metrics that fail to capture the distinctive traits of this genre. To address these gaps, we introduce DITING, the first comprehensive evaluation framework for web novel translation, assessing narrative and cultural fidelity across six dimensions: idiom translation, lexical ambiguity, terminology localization, tense consistency, zero-pronoun resolution, and cultural safety, supported by over 18K expert-annotated Chinese-English sentence pairs. We further propose AgentEval, a reasoning-driven multi-agent evaluation framework that simulates expert deliberation to assess translation quality beyond lexical overlap, achieving the highest correlation with human judgments among seven tested automatic metrics. To enable metric comparison, we develop MetricAlign, a meta-evaluation dataset of 300 sentence pairs annotated with error labels and scalar quality scores. Comprehensive evaluation of fourteen open, closed, and commercial models reveals that Chinese-trained LLMs surpass larger foreign counterparts, and that DeepSeek-V3 delivers the most faithful and stylistically coherent translations. Our work establishes a new paradigm for exploring LLM-based web novel translation and provides public resources to advance future research.

View Paper