Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling
Liliang Ren, Yang Liu, Yadong Lu, Yelong Shen, Chen Liang, Weizhu Chen
2024-06-14

Summary
This paper introduces Samba, a new model designed to efficiently handle very long sequences of data, like text, by combining two advanced techniques: Mamba, a selective State Space Model, and Sliding Window Attention.
What's the problem?
Modeling long sequences of data has been challenging because traditional methods either take too long to compute or struggle to handle very long inputs. This makes it hard for models to learn from and generate long texts effectively, which is important for many applications like chatbots and language translation.
What's the solution?
Samba solves this problem by using a hybrid approach that combines the strengths of Mamba and Sliding Window Attention. This allows the model to compress information from long sequences while still being able to recall important details accurately. The authors trained Samba with 3.8 billion parameters on a massive dataset, enabling it to predict and understand context lengths up to 256,000 tokens while maintaining high performance. This model operates much faster than previous models, achieving significant speed improvements when processing long inputs.
Why it matters?
This research is important because it provides a more efficient way to work with long sequences in language modeling. By improving how AI models handle extensive data, Samba can enhance applications in natural language processing, making them faster and more effective. This could lead to better user experiences in technologies like virtual assistants and automated writing tools.
Abstract
Efficiently modeling sequences with infinite context length has been a long-standing problem. Past works suffer from either the quadratic computation complexity or the limited extrapolation ability on length generalization. In this work, we present Samba, a simple hybrid architecture that layer-wise combines Mamba, a selective State Space Model (SSM), with Sliding Window Attention (SWA). Samba selectively compresses a given sequence into recurrent hidden states while still maintaining the ability to precisely recall memories with the attention mechanism. We scale Samba up to 3.8B parameters with 3.2T training tokens and show that Samba substantially outperforms the state-of-the-art models based on pure attention or SSMs on a wide range of benchmarks. When trained on 4K length sequences, Samba can be efficiently extrapolated to 256K context length with perfect memory recall and show improved token predictions up to 1M context length. As a linear-time sequence model, Samba enjoys a 3.73x higher throughput compared to Transformers with grouped-query attention when processing user prompts of 128K length, and 3.64x speedup when generating 64K tokens with unlimited streaming. A sample implementation of Samba is publicly available in https://github.com/microsoft/Samba.