Why Does the Effective Context Length of LLMs Fall Short?

Chenxin An, Jun Zhang, Ming Zhong, Lei Li, Shansan Gong, Yao Luo, Jingjing Xu, Lingpeng Kong

2024-10-25

Why Does the Effective Context Length of LLMs Fall Short?

Summary

This paper investigates why large language models (LLMs) often fail to use their full potential when processing long texts, despite advancements that allow them to handle larger context lengths.

What's the problem?

Even though LLMs can be trained to work with very long inputs (like 128,000 tokens), many open-source models only effectively utilize about half of that length. This limitation means they might not be able to gather and use all the important information from longer texts, leading to less accurate or incomplete responses.

What's the solution?

The authors identify that the issue arises from how these models were trained, specifically the way they handle relative positions of words in a sentence. To fix this, they introduce a method called STRING (ShifTed Rotray position embeddING), which adjusts the positions of words during the model's operation. This adjustment helps the model better access and use distant information without needing additional training. Their experiments show that using STRING can significantly improve the performance of models like Llama3.1 and Qwen2 on tasks that require understanding long contexts.

Why it matters?

This research is important because it helps improve how AI models understand and generate text based on long inputs. By enhancing their ability to utilize their full context length, these models can provide more accurate answers and better support tasks that involve complex information, such as summarizing articles or answering detailed questions.

Abstract

Advancements in distributed training and efficient attention mechanisms have significantly expanded the context window sizes of large language models (LLMs). However, recent work reveals that the effective context lengths of open-source LLMs often fall short, typically not exceeding half of their training lengths. In this work, we attribute this limitation to the left-skewed frequency distribution of relative positions formed in LLMs pretraining and post-training stages, which impedes their ability to effectively gather distant information. To address this challenge, we introduce ShifTed Rotray position embeddING (STRING). STRING shifts well-trained positions to overwrite the original ineffective positions during inference, enhancing performance within their existing training lengths. Experimental results show that without additional training, STRING dramatically improves the performance of the latest large-scale models, such as Llama3.1 70B and Qwen2 72B, by over 10 points on popular long-context benchmarks RULER and InfiniteBench, establishing new state-of-the-art results for open-source LLMs. Compared to commercial models, Llama 3.1 70B with \method even achieves better performance than GPT-4-128K and clearly surpasses Claude 2 and Kimi-chat.

View Paper