LongRecipe: Recipe for Efficient Long Context Generalization in Large Languge Models
Zhiyuan Hu, Yuliang Liu, Jinman Zhao, Suyuchen Wang, Yan Wang, Wei Shen, Qing Gu, Anh Tuan Luu, See-Kiong Ng, Zhiwei Jiang, Bryan Hooi
2024-09-04

Summary
This paper talks about LongRecipe, a new training method designed to help large language models (LLMs) better handle long pieces of text without needing a lot of extra resources.
What's the problem?
Large language models often struggle with tasks that involve long contexts because they have a limited ability to remember information from earlier in the text. While it's possible to increase their memory capacity, doing so usually requires a lot of computing power and resources, making it impractical for many applications.
What's the solution?
LongRecipe introduces an efficient training strategy that allows LLMs to process longer sequences of text by using techniques like token analysis and position index transformation. This method enables the models to simulate long text inputs while being resource-efficient. Experiments show that LongRecipe can effectively extend the model's memory from 8,000 tokens to 128,000 tokens while reducing training costs by over 85%.
Why it matters?
This research is important because it makes it easier for AI models to understand and generate longer texts, which is crucial for tasks like summarizing books or analyzing lengthy documents. By improving how LLMs handle long contexts, LongRecipe can enhance various applications in education, research, and content creation.
Abstract
Large language models (LLMs) face significant challenges in handling long-context tasks because of their limited effective context window size during pretraining, which restricts their ability to generalize over extended sequences. Meanwhile, extending the context window in LLMs through post-pretraining is highly resource-intensive. To address this, we introduce **LongRecipe**, an efficient training strategy for extending the context window of LLMs, including impactful token analysis, position index transformation, and training optimization strategies. It simulates long-sequence inputs while maintaining training efficiency and significantly improves the model's understanding of long-range dependencies. Experiments on three types of LLMs show that LongRecipe can utilize long sequences while requiring only 30% of the target context window size, and reduces computational training resource over 85% compared to full sequence training. Furthermore, LongRecipe also preserves the original LLM's capabilities in general tasks. Ultimately, *we can extend the effective context window of open-source LLMs from 8k to 128k, achieving performance close to GPT-4 with just one day of dedicated training using a single GPU with 80G memory.* Our code is released at the [link](https://github.com/zhiyuanhubj/LongRecipe).