A Controlled Study on Long Context Extension and Generalization in LLMs
Yi Lu, Jing Nathan Yan, Songlin Yang, Justin T. Chiu, Siyu Ren, Fei Yuan, Wenting Zhao, Zhiyong Wu, Alexander M. Rush
2024-09-19

Summary
This paper discusses a study on how to improve language models' ability to understand and generate text using long contexts, which means they can consider more information at once when responding or generating text.
What's the problem?
Language models need to use full document contexts to understand and learn effectively. However, training these models to handle long contexts is challenging, and different methods have been proposed without a clear way to compare their effectiveness. This leads to uncertainty about how well these models perform when given long pieces of text.
What's the solution?
The researchers implemented a controlled study that standardized how different methods for extending context are evaluated. They used consistent base models and extension data to test various approaches. Their findings showed that perplexity is still a useful measure of performance even for longer contexts. They also discovered that some current methods for approximating attention in long contexts do not work well, while methods that involve fine-tuning the model generally perform better within their intended range.
Why it matters?
This research is significant because it helps clarify how to effectively extend the capabilities of language models, which is crucial for tasks that require understanding large amounts of text, such as summarizing documents or answering questions based on lengthy articles. By making this information available as open-source, it encourages further research and development in the field of AI.
Abstract
Broad textual understanding and in-context learning require language models that utilize full document contexts. Due to the implementation challenges associated with directly training long-context models, many methods have been proposed for extending models to handle long contexts. However, owing to differences in data and model classes, it has been challenging to compare these approaches, leading to uncertainty as to how to evaluate long-context performance and whether it differs from standard evaluation. We implement a controlled protocol for extension methods with a standardized evaluation, utilizing consistent base models and extension data. Our study yields several insights into long-context behavior. First, we reaffirm the critical role of perplexity as a general-purpose performance indicator even in longer-context tasks. Second, we find that current approximate attention methods systematically underperform across long-context tasks. Finally, we confirm that exact fine-tuning based methods are generally effective within the range of their extension, whereas extrapolation remains challenging. All codebases, models, and checkpoints will be made available open-source, promoting transparency and facilitating further research in this critical area of AI development.