The Mamba in the Llama: Distilling and Accelerating Hybrid Models

Junxiong Wang, Daniele Paliotta, Avner May, Alexander M. Rush, Tri Dao

2024-08-28

The Mamba in the Llama: Distilling and Accelerating Hybrid Models

Summary

This paper discusses how to improve the performance of language models by converting large Transformer models into smaller, more efficient linear RNN (Recurrent Neural Network) models without losing their effectiveness.

What's the problem?

Large Transformer models are powerful for language tasks but can be difficult and expensive to run, especially on devices with limited resources. This makes it hard to deploy them in real-world applications where efficiency is important.

What's the solution?

The authors propose a method called Mamba that distills the knowledge from large Transformer models into linear RNNs. They achieve this by reusing certain weights from the original model, which allows the new model to perform comparably in chat benchmarks while using fewer resources. Additionally, they introduce a new decoding algorithm that speeds up how quickly the model can generate responses. This approach allows for effective performance even with fewer attention layers, making it easier to deploy.

Why it matters?

This research is important because it shows a way to make advanced language models more accessible and efficient, allowing them to be used in more applications without needing powerful hardware. By improving how these models work, we can enhance various technologies like chatbots and virtual assistants, making them faster and more responsive.

Abstract

Linear RNN architectures, like Mamba, can be competitive with Transformer models in language modeling while having advantageous deployment characteristics. Given the focus on training large-scale Transformer models, we consider the challenge of converting these pretrained models for deployment. We demonstrate that it is feasible to distill large Transformers into linear RNNs by reusing the linear projection weights from attention layers with academic GPU resources. The resulting hybrid model, which incorporates a quarter of the attention layers, achieves performance comparable to the original Transformer in chat benchmarks and outperforms open-source hybrid Mamba models trained from scratch with trillions of tokens in both chat benchmarks and general benchmarks. Moreover, we introduce a hardware-aware speculative decoding algorithm that accelerates the inference speed of Mamba and hybrid models. Overall we show how, with limited computation resources, we can remove many of the original attention layers and generate from the resulting model more efficiently. Our top-performing model, distilled from Llama3-8B-Instruct, achieves a 29.61 length-controlled win rate on AlpacaEval 2 against GPT-4 and 7.35 on MT-Bench, surpassing the best instruction-tuned linear RNN model.

View Paper