DocReward: A Document Reward Model for Structuring and Stylizing

Junpeng Liu, Yuzhong Zhao, Bowen Cao, Jiayu Ding, Yilin Jia, Tengchao Lv, Yupan Huang, Shaohan Huang, Nan Yang, Li Dong, Lei Cui, Tao Ge, Xun Wang, Huitian Jiao, Sun Mao, FNU Kartik, Si-Qing Chen, Wai Lam, Furu Wei

2025-10-14

DocReward: A Document Reward Model for Structuring and Stylizing

Summary

This paper introduces DocReward, a new system designed to automatically assess the quality of documents, focusing on how they *look* rather than just what they *say*. It aims to help computer programs create better-looking and more professional documents.

What's the problem?

Current AI systems that automatically write documents are really good at making sure the text is grammatically correct and makes sense, but they often ignore things like formatting, layout, and overall visual appeal. These visual elements are super important for making a document easy to read and engaging. The main issue is that there weren't any good tools to 'teach' these AI systems what makes a document look professional, because it's hard to create a system that can judge visual quality.

What's the solution?

The researchers created DocReward, which is essentially a scoring system for documents. They built a huge dataset of over 117,000 documents, each with a 'professional' and 'unprofessional' version of the same content. This allowed DocReward to learn what features contribute to a polished look. The system was trained to consistently pick the more professional-looking document in a pair. They also created a way to test how well DocReward performs, using human evaluations as a benchmark.

Why it matters?

This work is important because it allows AI to create documents that aren't just accurate, but also visually appealing and easy to use. DocReward significantly outperforms existing AI models like GPT-4o and GPT-5 in judging document quality, and it leads to AI-generated documents that people actually prefer over those created by other systems. This could have a big impact on fields like report writing, marketing, and any other area where clear and professional documents are essential.

Abstract

Recent advances in agentic workflows have enabled the automation of tasks such as professional document generation. However, they primarily focus on textual quality, neglecting visual structure and style, which are crucial for readability and engagement. This gap arises mainly from the absence of suitable reward models to guide agentic workflows toward producing documents with stronger structural and stylistic quality. To address this, we propose DocReward, a document reward model that evaluates documents based on their structure and style. We construct a multi-domain dataset DocPair of 117K paired documents, covering 32 domains and 267 document types, each including a high- and low-professionalism document with identical content but different structure and style. This enables the model to evaluate professionalism comprehensively, and in a textual-quality-agnostic way. DocReward is trained using the Bradley-Terry loss to score documents, penalizing predictions that contradict the annotated ranking. To assess the performance of reward models, we create a test dataset containing document bundles ranked by well-educated human evaluators. Notably, DocReward outperforms GPT-4o and GPT-5 in accuracy by 30.6 and 19.4 percentage points, respectively, demonstrating its superiority over baselines. In an extrinsic evaluation of document generation, DocReward achieves a significantly higher win rate of 60.8%, compared to GPT-5's 37.7% win rate, demonstrating its utility in guiding generation agents toward producing human-preferred documents.

View Paper