Liger: Linearizing Large Language Models to Gated Recurrent Structures
Disen Lan, Weigao Sun, Jiaxi Hu, Jusen Du, Yu Cheng
2025-03-04
Summary
This paper talks about Liger, a new method for making large AI language models work faster and more efficiently by changing their structure without losing their ability to understand and generate language.
What's the problem?
Current AI language models are very powerful but can be slow and use a lot of memory. Some faster versions exist, but creating them from scratch is expensive and risky. Methods to convert existing models into faster versions often add extra parts that need a lot of training and don't work as well as they could.
What's the solution?
The researchers created Liger, which transforms existing AI models into a faster type called gated linear recurrent models. Liger cleverly reuses parts of the original model to create new structures that control information flow, without adding any new pieces. It also uses a special fine-tuning method to make sure the converted model works just as well as the original. They also invented Liger Attention, which combines two different ways of processing information to make the model even more efficient.
Why it matters?
This matters because it could make AI language models much faster and more efficient without losing their capabilities. This means AI could be used in more places, like on smartphones or other devices with limited processing power. It also shows a way to improve AI models without having to build them from scratch, which could save time and resources in AI development.
Abstract
Transformers with linear recurrent modeling offer linear-time training and constant-memory inference. Despite their demonstrated efficiency and performance, pretraining such non-standard architectures from scratch remains costly and risky. The linearization of large language models (LLMs) transforms pretrained standard models into linear recurrent structures, enabling more efficient deployment. However, current linearization methods typically introduce additional feature map modules that require extensive fine-tuning and overlook the gating mechanisms used in state-of-the-art linear recurrent models. To address these issues, this paper presents Liger, short for Linearizing LLMs to gated recurrent structures. Liger is a novel approach for converting pretrained LLMs into gated linear recurrent models without adding extra parameters. It repurposes the pretrained key matrix weights to construct diverse gating mechanisms, facilitating the formation of various gated recurrent structures while avoiding the need to train additional components from scratch. Using lightweight fine-tuning with Low-Rank Adaptation (LoRA), Liger restores the performance of the linearized gated recurrent models to match that of the original LLMs. Additionally, we introduce <PRE_TAG>Liger Attention</POST_TAG>, an intra-layer hybrid attention mechanism, which significantly recovers 93\% of the Transformer-based LLM at 0.02\% pre-training tokens during the linearization process, achieving competitive results across multiple benchmarks, as validated on models ranging from 1B to 8B parameters. Code is available at https://github.com/OpenSparseLLMs/Linearization.