Spinning the Golden Thread: Benchmarking Long-Form Generation in Language Models
Yuhao Wu, Ming Shan Hee, Zhiqing Hu, Roy Ka-Wei Lee
2024-09-09

Summary
This paper talks about a new way to evaluate how well language models can generate long pieces of text by introducing a benchmark called Spinning the Golden Thread (SGT).
What's the problem?
While current tests for language models focus on finding specific information in long texts, they don't really measure how well these models can create coherent and meaningful long-form text. This is important for tasks like writing stories or proposals, where the quality of the generated text matters.
What's the solution?
To address this issue, the authors created the SGT benchmark, which challenges language models to generate long texts that include specific events or details. They tested ten different long-context language models using various prompts and lengths of text. Their findings showed that while these models did well on traditional tests, they struggled with the SGT benchmark, indicating that they often fail to produce high-quality, coherent long-form text.
Why it matters?
This research is important because it highlights the need for better evaluation methods for language models, especially as they are increasingly used in creative and professional writing. By improving how we assess these models, we can develop more effective tools for generating high-quality text that meets user needs.
Abstract
The abilities of long-context language models (LMs) are often evaluated using the "Needle-in-a-Haystack" (NIAH) test, which comprises tasks designed to assess a model's ability to identify specific information ("needle") within large text sequences ("haystack"). While these benchmarks measure how well models understand long-context input sequences, they do not effectively gauge the quality of long-form text generation--a critical aspect for applications such as design proposals and creative writing. To address this gap, we have introduced a new long-form text evaluation benchmark, Spinning the Golden Thread (SGT), which tests models' ability to identify specific events within generated long text sequences. In this benchmark, we prompt long-context LMs to create long-form text that must include particular events or constraints and evaluate their ability to incorporate these elements. We evaluated ten long-context LMs across four distinct scenarios, three types of prompt instructions, and two different generation-length settings (16K and 32K). Although these models perform well on NIAH benchmarks, none demonstrated satisfactory performance on the Spinning the Golden Thread, raising concerns about their ability to generate coherent long-form text that follows instructions. Additionally, as the length of the generated text increases, all models exhibit a significant drop in performance.