LitBench: A Benchmark and Dataset for Reliable Evaluation of Creative Writing

Daniel Fein, Sebastian Russo, Violet Xiang, Kabir Jolly, Rafael Rafailov, Nick Haber

2025-07-07

LitBench: A Benchmark and Dataset for Reliable Evaluation of Creative
Writing

Summary

This paper talks about LitBench, a new benchmark and dataset designed to reliably evaluate the creative writing skills of large language models. It uses advanced methods like zero-shot judges, a voting system called Bradley Terry, and AI-generated reward models to assess writing quality.

What's the problem?

The problem is that judging creative writing by AI is very hard because creativity is subjective and traditional evaluation methods are inconsistent or not accurate enough, making it tough to fairly compare different models’ writing.

What's the solution?

The researchers created LitBench with a set of diverse creative writing tasks and employed multiple evaluation methods that don’t rely on human judges but instead use AI judges and sophisticated ranking systems. These methods proved to be more accurate and consistent than common off-the-shelf evaluation tools.

Why it matters?

This matters because it helps improve how we measure AI creativity, allowing better development of language models that can produce more engaging stories, poems, and creative texts, which can be useful for many artistic and educational applications.

Abstract

LitBench, a standardized benchmark and dataset, evaluates creative writing by LLMs using zero-shot judges, Bradley Terry, and generative reward models, demonstrating higher accuracy than off-the-shelf judges.

View Paper