Hymba: A Hybrid-head Architecture for Small Language Models
Xin Dong, Yonggan Fu, Shizhe Diao, Wonmin Byeon, Zijia Chen, Ameya Sunil Mahabaleshwarkar, Shih-Yang Liu, Matthijs Van Keirsbilck, Min-Hung Chen, Yoshi Suhara, Yingyan Lin, Jan Kautz, Pavlo Molchanov
2024-11-22
Summary
This paper introduces Hymba, a new type of small language model that combines two different methods of processing information to improve efficiency and performance.
What's the problem?
Existing small language models often struggle with balancing the need for detailed information recall and efficient processing. They typically use either attention mechanisms, which are great for recalling specific details, or state space models (SSMs), which summarize information well but may lose important details. This can limit their effectiveness in various tasks.
What's the solution?
Hymba solves this problem by using a hybrid-head architecture that integrates both attention heads and SSM heads in parallel. This means that while one part of the model focuses on recalling detailed information, another part efficiently summarizes context. Additionally, Hymba introduces learnable meta tokens that help the model focus on important information without overwhelming it. The model also optimizes memory usage through techniques like cross-layer key-value sharing, making it more efficient overall.
Why it matters?
This research is significant because it enhances the capabilities of small language models, allowing them to perform better on a wider range of tasks while using less memory. By improving how these models process and recall information, Hymba can lead to advancements in AI applications such as chatbots, translation services, and other areas where understanding language is crucial.
Abstract
We propose Hymba, a family of small language models featuring a hybrid-head parallel architecture that integrates transformer attention mechanisms with state space models (SSMs) for enhanced efficiency. Attention heads provide high-resolution recall, while SSM heads enable efficient context summarization. Additionally, we introduce learnable meta tokens that are prepended to prompts, storing critical information and alleviating the "forced-to-attend" burden associated with attention mechanisms. This model is further optimized by incorporating cross-layer key-value (KV) sharing and partial sliding window attention, resulting in a compact cache size. During development, we conducted a controlled study comparing various architectures under identical settings and observed significant advantages of our proposed architecture. Notably, Hymba achieves state-of-the-art results for small LMs: Our Hymba-1.5B-Base model surpasses all sub-2B public models in performance and even outperforms Llama-3.2-3B with 1.32% higher average accuracy, an 11.67x cache size reduction, and 3.49x throughput.