LongRoPE2: Near-Lossless LLM Context Window Scaling

Ning Shang, Li Lyna Zhang, Siyuan Wang, Gaokai Zhang, Gilsinia Lopez, Fan Yang, Weizhu Chen, Mao Yang

2025-02-28

LongRoPE2: Near-Lossless LLM Context Window Scaling

Summary

This paper talks about LongRoPE2, a new method that helps large language models (LLMs) understand and work with much longer pieces of text without losing their ability to handle shorter texts.

What's the problem?

Current LLMs are good at understanding short pieces of text, but they struggle when dealing with very long texts. When researchers try to make them work with longer texts, the models often get confused and don't perform as well as they should.

What's the solution?

The researchers came up with LongRoPE2, which does three main things. First, they figured out why LLMs have trouble with long texts. Second, they created a clever way to adjust how the model looks at long texts, using something called 'needle-driven' perplexity. Third, they trained the model to work well with both short and long texts at the same time. They tested this on two different LLMs and found that it worked really well, allowing one model to handle texts up to 128,000 words long while still being great at short texts.

Why it matters?

This matters because it could make AI language models much more useful for tasks that involve long documents, like analyzing books or long reports. It's also impressive because LongRoPE2 achieves this using much less training data than other methods, which could make it easier and cheaper to create more capable AI models. This could lead to better AI assistants, more efficient research tools, and improved language understanding in various applications.

Abstract

LongRoPE2 is a novel approach that extends the effective context window of pre-trained large language models (LLMs) to the target length, while preserving the performance on the original shorter context window. This is achieved by three contributions: (1) a hypothesis that insufficient training in higher RoPE dimensions contributes to the persistent out-of-distribution (OOD) issues observed in existing methods; (2) an effective RoPE rescaling algorithm that adopts evolutionary search guided by "needle-driven" perplexity to address the insufficient training problem; (3) a mixed context window training approach that fine-tunes model weights to adopt rescaled RoPE for long-context sequences while preserving the short-context performance with the original RoPE. Extensive experiments on LLaMA3-8B and Phi3-mini-3.8B across various benchmarks validate the hypothesis and demonstrate the effectiveness of LongRoPE2. Remarkably, LongRoPE2 extends LLaMA3-8B to achieve a 128K effective context length while retaining over 98.5% of short-context performance, using only 10B tokens -- 80x fewer than Meta's approach, which fails to reach the target effective context length. Code will be available at https://github.com/microsoft/LongRoPE.

View Paper