Effective Distillation to Hybrid xLSTM Architectures
Lukas Hauzenberger, Niklas Schmidinger, Thomas Schmied, Anamaria-Roberta Hartl, David Stap, Pieter-Jan Hoedt, Maximilian Beck, Sebastian Böck, Günter Klambauer, Sepp Hochreiter
2026-03-17
Summary
This paper focuses on making large language models, which are usually very computationally expensive, more efficient without losing their ability to perform well. They aim to create smaller, faster models that can match the performance of the larger, more complex ones.
What's the problem?
Currently, when researchers try to simplify large language models based on a technique called 'quadratic attention' into more manageable forms, the resulting simplified models consistently perform worse than the original, larger models on various tasks. It's difficult to create a smaller model that retains all the knowledge and capabilities of its bigger counterpart.
What's the solution?
The researchers developed a new method for 'distilling' knowledge from a large language model into a smaller one built using a different architecture called xLSTM. Their key innovation is a merging step where they combine several smaller, specialized parts of the xLSTM model into a single, more powerful model. They tested this approach with models from the Llama, Qwen, and Olmo families and found that their distilled xLSTM models often performed almost as well as, and sometimes even better than, the original models.
Why it matters?
This work is important because it represents a significant step towards creating large language models that are more energy-efficient and cheaper to use. If we can build models that perform just as well with fewer resources, it will make this powerful technology more accessible and sustainable for a wider range of applications.
Abstract
There have been numerous attempts to distill quadratic attention-based large language models (LLMs) into sub-quadratic linearized architectures. However, despite extensive research, such distilled models often fail to match the performance of their teacher LLMs on various downstream tasks. We set out the goal of lossless distillation, which we define in terms of tolerance-corrected Win-and-Tie rates between student and teacher on sets of tasks. To this end, we introduce an effective distillation pipeline for xLSTM-based students. We propose an additional merging stage, where individually linearized experts are combined into a single model. We show the effectiveness of this pipeline by distilling base and instruction-tuned models from the Llama, Qwen, and Olmo families. In many settings, our xLSTM-based students recover most of the teacher's performance, and even exceed it on some downstream tasks. Our contributions are an important step towards more energy-efficient and cost-effective replacements for transformer-based LLMs.