Taipan: Efficient and Expressive State Space Language Models with Selective Attention
Chien Van Nguyen, Huy Huu Nguyen, Thang M. Pham, Ruiyi Zhang, Hanieh Deilamsalehy, Puneet Mathur, Ryan A. Rossi, Trung Bui, Viet Dac Lai, Franck Dernoncourt, Thien Huu Nguyen
2024-10-25

Summary
This paper introduces Taipan, a new type of language model designed to efficiently process long sequences of text while maintaining high performance in understanding and generating language.
What's the problem?
Large language models (LLMs) like Transformers struggle with long texts because they require a lot of memory and processing power. As the length of the input increases, the resources needed grow exponentially, making it difficult to handle very long sequences effectively. Existing models that use state space methods can manage memory better but often fail to perform well when they need to retrieve information from far away in the text.
What's the solution?
Taipan combines the strengths of a state space model called Mamba-2 with Selective Attention Layers (SALs). SALs help the model focus on important parts of the input that require long-range interactions while ignoring less important details. This allows Taipan to handle context lengths of up to 1 million tokens without sacrificing efficiency. The design ensures that the model can make accurate predictions even when dealing with very long texts.
Why it matters?
This research is significant because it addresses a major limitation in how language models process long sequences, which is crucial for applications like summarizing articles, answering questions based on lengthy documents, and more. By improving the ability of models to understand and generate language over extended contexts, Taipan could enhance various fields such as education, content creation, and data analysis.
Abstract
Efficient long-context language modeling remains a significant challenge in Natural Language Processing (NLP). While Transformers dominate language tasks, they struggle with long sequences due to quadratic computational complexity in training and linearly scaling memory costs during inference. Recent State Space Models (SSMs) such as Mamba offer alternatives with constant memory usage, but they underperform in tasks requiring extensive in-context retrieval. We introduce Taipan, a novel hybrid architecture that combines Mamba-2 with Selective Attention Layers (SALs). These SALs identify tokens requiring long-range interactions, remove less important features, and then augment their representations using the attention module. This approach balances Mamba's efficiency with Transformer-like performance in memory-intensive tasks. By constraining the attention budget, Taipan extends accurate predictions to context lengths of up to 1 million tokens while preserving computational efficiency. Our experiments demonstrate Taipan's superior performance across various scales and tasks, offering a promising solution for efficient long-context language modeling.