Scaling Laws for Linear Complexity Language Models
Xuyang Shen, Dong Li, Ruitao Leng, Zhen Qin, Weigao Sun, Yiran Zhong
2024-06-25
Summary
This paper discusses the scaling laws for linear complexity language models, which are a type of model designed to process language more efficiently. The study looks at how well these models can grow and handle larger tasks compared to traditional models.
What's the problem?
As large language models become more common, there's a need to understand how well these linear complexity models can scale up in size and performance. Many existing models are limited in their ability to handle larger datasets or more complex tasks, which raises questions about their effectiveness in real-world applications.
What's the solution?
The researchers examined three specific linear architectures: TNL, which uses a type of attention that doesn't depend on the data; HGRN2, which is a recurrent neural network that adapts based on the data; and cosFormer2, which does not decay over time. They compared these models to LLaMA, another well-known model, using a large dataset of 300 billion tokens. By evaluating their performance across various tasks, they found that linear complexity models can scale similarly to traditional transformer models while also showing better language understanding and retention of knowledge.
Why it matters?
This research is important because it helps establish a foundation for using linear complexity models in practical applications. By demonstrating that these models can perform well at larger scales, it opens up new possibilities for developing more efficient AI systems that can handle complex language tasks without requiring as much computational power.
Abstract
The interest in linear complexity models for large language models is on the rise, although their scaling capacity remains uncertain. In this study, we present the scaling laws for linear complexity language models to establish a foundation for their scalability. Specifically, we examine the scaling behaviors of three efficient linear architectures. These include TNL, a linear attention model with data-independent decay; HGRN2, a linear RNN with data-dependent decay; and cosFormer2, a linear attention model without decay. We also include LLaMA as a baseline architecture for softmax attention for comparison. These models were trained with six variants, ranging from 70M to 7B parameters on a 300B-token corpus, and evaluated with a total of 1,376 intermediate checkpoints on various downstream tasks. These tasks include validation loss, commonsense reasoning, and information retrieval and generation. The study reveals that existing linear complexity language models exhibit similar scaling capabilities as conventional transformer-based models while also demonstrating superior linguistic proficiency and knowledge retention.