DramaBench: A Six-Dimensional Evaluation Framework for Drama Script Continuation

Shijian Ma, Yunqi Huang, Yan Lin

2025-12-25

DramaBench: A Six-Dimensional Evaluation Framework for Drama Script Continuation

Summary

This paper introduces a new way to test how well computer programs can write continuing parts of plays or dramas, focusing on whether the writing feels natural and makes sense.

What's the problem?

Currently, there aren't good tests to see if a computer can actually write a good continuation of a play script. Existing tests don't check for things like keeping characters acting the same way, making the story move forward logically, or maintaining the overall structure of a drama. Basically, it's hard to tell if a computer is writing something that *feels* like a real play continuation.

What's the solution?

The researchers created a benchmark called DramaBench. This benchmark looks at six specific qualities in the computer-generated script continuations: following standard play formatting, making the story progress efficiently, keeping characters consistent, showing believable emotions, ensuring logical consistency, and handling conflicts well. They used a mix of automated checks and had other people evaluate the scripts to make sure the results were fair and reliable. They then tested eight different state-of-the-art language models using over a thousand play scripts.

Why it matters?

This work is important because it provides a much more thorough and objective way to evaluate how good computers are at creative writing, specifically in the realm of drama. It doesn't just give a score, but tells developers *how* their models can improve in specific areas like character development or plot progression, ultimately pushing the field of AI-assisted creative writing forward.

Abstract

Drama script continuation requires models to maintain character consistency, advance plot coherently, and preserve dramatic structurecapabilities that existing benchmarks fail to evaluate comprehensively. We present DramaBench, the first large-scale benchmark for evaluating drama script continuation across six independent dimensions: Format Standards, Narrative Efficiency, Character Consistency, Emotional Depth, Logic Consistency, and Conflict Handling. Our framework combines rulebased analysis with LLM-based labeling and statistical metrics, ensuring objective and reproducible evaluation. We conduct comprehensive evaluation of 8 state-of-the-art language models on 1,103 scripts (8,824 evaluations total), with rigorous statistical significance testing (252 pairwise comparisons, 65.9% significant) and human validation (188 scripts, substantial agreement on 3/5 dimensions). Our ablation studies confirm all six dimensions capture independent quality aspects (mean | r | = 0.020). DramaBench provides actionable, dimensionspecific feedback for model improvement and establishes a rigorous standard for creative writing evaluation.

View Paper