Beyond Correctness: Evaluating Subjective Writing Preferences Across Cultures
Shuangshuang Ying, Yunwen Li, Xingwei Qu, Xin Li, Sheng Jin, Minghao Liu, Zhoufutu Wen, Xeron Du, Tianyu Zheng, Yichi Zhang, Letian Ni, Yuyang Cheng, Qiguang Chen, Jingzhe Ding, Shengda Long, Wangchunshu Zhou, Jiazhan Feng, Wanjun Zhong, Libo Qin, Ge Zhang, Wenhao Huang, Wanxiang Che
2025-10-17
Summary
This paper investigates how well AI models can understand what makes creative writing *good*, beyond just whether it's factually correct. It finds that current methods struggle with subjective qualities like creativity and style, and instead focus on identifying obvious errors.
What's the problem?
AI models are really good at learning preferences when it's easy to tell if something is right or wrong. However, when judging creative writing, where 'good' is more about style and feeling than facts, these models perform surprisingly poorly. They basically fail to grasp what humans actually prefer in writing, even when the options are already confirmed to be factually accurate and of similar length.
What's the solution?
The researchers created a new dataset called WritingPreferenceBench, containing 1,800 examples of human preferences for different pieces of writing across eight genres, in both English and Chinese. They then tested different types of AI models on this dataset. They found that models which *explain their reasoning* when making a judgment (generative reward models) did much better than models that just directly pick a preferred option (sequence-based reward models). They also observed that performance varied greatly depending on the writing genre, and bigger models didn't necessarily perform better.
Why it matters?
This research highlights a major limitation of current AI training techniques, specifically Reinforcement Learning from Human Feedback (RLHF). It shows that simply training models to mimic human choices isn't enough when dealing with subjective qualities. To build AI that can truly understand and generate high-quality creative content, we need to focus on getting models to reason about *why* something is good, not just *that* it is.
Abstract
Current preference learning methods achieve high accuracy on standard benchmarks but exhibit significant performance degradation when objective quality signals are removed. We introduce WritingPreferenceBench, a dataset of 1,800 human-annotated preference pairs (1,200 English, 600 Chinese) across 8 creative writing genres, where responses are matched for objective correctness, factual accuracy, and length. On this benchmark, sequence-based reward models--the standard architecture for RLHF--achieve only 52.7% mean accuracy, while zero-shot language model judges perform at 53.9%. In contrast, generative reward models that produce explicit reasoning chains achieve 81.8% accuracy. We observe high within-model variance across genres: individual models range from 18.2% to 81.8% accuracy across different writing categories, with standard deviations averaging 10.1%. This variance persists regardless of model scale, with 27B parameter models showing no consistent improvement over 8B variants. Our results suggest that current RLHF methods primarily learn to detect objective errors rather than capture subjective quality preferences (e.g., creativity, stylistic flair, and emotional resonance), and that successful preference modeling may require intermediate reasoning representations rather than direct classification.