Lost in Stories: Consistency Bugs in Long Story Generation by LLMs
Junjie Li, Xinrui Guo, Yuhao Wu, Roy Ka-Wei Lee, Hongzhi Li, Yutao Xie
2026-03-10
Summary
This paper investigates a problem with large language models (LLMs) – when they write long stories, they often forget details they've already established, leading to inconsistencies.
What's the problem?
LLMs are getting really good at writing long pieces of text, like stories, but they struggle to keep everything straight. They might change a character's name halfway through, or say something happened on Tuesday when it actually happened on Monday. Current ways of testing these models focus on how *good* the story is, not whether it makes sense from beginning to end, so these consistency issues haven't been thoroughly examined.
What's the solution?
The researchers created a new testing tool called ConStory-Bench. This tool gives LLMs story prompts and then checks if the stories they write contradict themselves. It looks for five different types of errors, broken down into 19 specific kinds of inconsistencies. They also built ConStory-Checker, a program that automatically finds these contradictions and points to the exact parts of the story where they happen. They then tested several LLMs using this tool to see where these errors usually occur.
Why it matters?
Understanding where and why LLMs make these consistency errors is crucial for improving them. The research found that errors are common with facts and timelines, often appear in the middle of stories, and happen more often in parts of the text that are less predictable. This knowledge can help developers build better LLMs that can tell longer, more coherent stories.
Abstract
What happens when a storyteller forgets its own story? Large Language Models (LLMs) can now generate narratives spanning tens of thousands of words, but they often fail to maintain consistency throughout. When generating long-form narratives, these models can contradict their own established facts, character traits, and world rules. Existing story generation benchmarks focus mainly on plot quality and fluency, leaving consistency errors largely unexplored. To address this gap, we present ConStory-Bench, a benchmark designed to evaluate narrative consistency in long-form story generation. It contains 2,000 prompts across four task scenarios and defines a taxonomy of five error categories with 19 fine-grained subtypes. We also develop ConStory-Checker, an automated pipeline that detects contradictions and grounds each judgment in explicit textual evidence. Evaluating a range of LLMs through five research questions, we find that consistency errors show clear tendencies: they are most common in factual and temporal dimensions, tend to appear around the middle of narratives, occur in text segments with higher token-level entropy, and certain error types tend to co-occur. These findings can inform future efforts to improve consistency in long-form narrative generation. Our project page is available at https://picrew.github.io/constory-bench.github.io/.