CLIPPER: Compression enables long-context synthetic data generation
Chau Minh Pham, Yapei Chang, Mohit Iyyer
2025-02-21
Summary
This paper talks about CLIPPER, a new way to create high-quality synthetic data for training AI models to understand and reason about long pieces of text, like books. It's like teaching a computer to read a book and answer questions about it more accurately.
What's the problem?
AI developers need lots of practice data to train their models, but creating good quality data for tasks that involve understanding long texts, like books, is really hard. When they try to make this data directly from the book's text, it often ends up being low quality or unrealistic.
What's the solution?
The researchers created CLIPPER, which works by first summarizing a book into chapter outlines and overall summaries. Then, it uses these summaries to create realistic questions and answers about the book. This method produces better quality data than trying to generate questions directly from the full book text. They used CLIPPER to create a dataset of 19,000 book-related questions and answers, which they then used to train AI models.
Why it matters?
This matters because it helps AI get much better at understanding and answering questions about long texts like books. The AI trained with CLIPPER's data improved its accuracy on book-related questions from 28% to 76%, which is a huge jump. This could lead to AI assistants that are much better at helping with tasks that require understanding complex information from long documents, like studying literature or doing research.
Abstract
LLM developers are increasingly reliant on synthetic data, but generating high-quality data for complex long-context reasoning tasks remains challenging. We introduce CLIPPER, a compression-based approach for generating synthetic data tailored to narrative claim verification - a task that requires reasoning over a book to verify a given claim. Instead of generating claims directly from the raw text of the book, which results in artifact-riddled claims, CLIPPER first compresses the book into chapter outlines and book summaries and then uses these intermediate representations to generate complex claims and corresponding chain-of-thoughts. Compared to naive approaches, CLIPPER produces claims that are more valid, grounded, and complex. Using CLIPPER, we construct a dataset of 19K synthetic book claims paired with their source texts and chain-of-thought reasoning, and use it to fine-tune three open-weight models. Our best model achieves breakthrough results on narrative claim verification (from 28% to 76% accuracy on our test set) and sets a new state-of-the-art for sub-10B models on the NoCha leaderboard. Further analysis shows that our models generate more detailed and grounded chain-of-thought reasoning while also improving performance on other narrative understanding tasks (e.g., NarrativeQA).