Discovering the Gems in Early Layers: Accelerating Long-Context LLMs with 1000x Input Token Reduction

Zhenmei Shi, Yifei Ming, Xuan-Phi Nguyen, Yingyu Liang, Shafiq Joty

2024-09-27

Discovering the Gems in Early Layers: Accelerating Long-Context LLMs with 1000x Input Token Reduction

Summary

This paper talks about a new approach called implicit instruction tuning, which allows language models to follow instructions effectively without needing extensive training on instruction-response pairs. It shows that models can learn to follow instructions even when trained only on responses or narrow topics.

What's the problem?

Typically, training language models to follow instructions requires a lot of data that pairs specific instructions with their correct responses. This process, called instruction tuning, can be time-consuming and may not always produce the desired results. The challenge is to find simpler ways for models to learn how to respond correctly to various instructions without relying heavily on these paired examples.

What's the solution?

The researchers discovered that language models can still learn to follow instructions by training them only on responses, without any direct instructions. They found that even when the model is trained on a specific topic, like poetry, it can still generate appropriate responses for different tasks, such as recipe generation. This suggests that the models have an inherent ability to understand and rank responses based on the context of the instructions they receive. The study also introduced simple adjustments to the model's behavior that enhance its ability to follow instructions.

Why it matters?

This research is important because it simplifies the training process for language models, making them more efficient and easier to develop. By demonstrating that implicit instruction tuning can yield effective instruction-following behavior, the findings suggest that we can save time and resources while still achieving good performance. This could lead to more versatile AI systems that are better at understanding and responding to user requests in various applications.

Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities in handling long context inputs, but this comes at the cost of increased computational resources and latency. Our research introduces a novel approach for the long context bottleneck to accelerate LLM inference and reduce GPU memory consumption. Our research demonstrates that LLMs can identify relevant tokens in the early layers before generating answers to a query. Leveraging this insight, we propose an algorithm that uses early layers of an LLM as filters to select and compress input tokens, significantly reducing the context length for subsequent processing. Our method, GemFilter, demonstrates substantial improvements in both speed and memory efficiency compared to existing techniques, such as standard attention and SnapKV/H2O. Notably, it achieves a 2.4times speedup and 30\% reduction in GPU memory usage compared to SOTA methods. Evaluation on the Needle in a Haystack task shows that GemFilter significantly outperforms standard attention, SnapKV and demonstrates comparable performance on the LongBench challenge. GemFilter is simple, training-free, and broadly applicable across different LLMs. Crucially, it provides interpretability by allowing humans to inspect the selected input sequence. These findings not only offer practical benefits for LLM deployment, but also enhance our understanding of LLM internal mechanisms, paving the way for further optimizations in LLM design and inference. Our code is available at https://github.com/SalesforceAIResearch/GemFilter.

View Paper