Gated Slot Attention for Efficient Linear-Time Sequence Modeling

Yu Zhang, Songlin Yang, Ruijie Zhu, Yue Zhang, Leyang Cui, Yiqiao Wang, Bolun Wang, Freda Shi, Bailin Wang, Wei Bi, Peng Zhou, Guohong Fu

2024-09-12

Gated Slot Attention for Efficient Linear-Time Sequence Modeling

Summary

This paper talks about Gated Slot Attention (GSA), a new method designed to improve the efficiency of language models in processing sequences by enhancing how they manage memory.

What's the problem?

Current linear attention models, while faster and capable of parallel training, still struggle with tasks that require remembering a lot of information over time. They also need a lot of resources to train from scratch, making them less practical for many applications.

What's the solution?

The authors propose GSA, which builds on an existing method called Attention with Bounded-memory-Control (ABC). GSA uses a gating mechanism to manage memory better, allowing the model to focus on important information while keeping its memory size small. This method improves both training speed and performance in tasks that require recalling information. It also helps fine-tune pre-trained models more effectively without needing extensive retraining.

Why it matters?

This research is important because it makes language models more efficient and capable of handling complex tasks that require remembering details over time. By improving how these models work, GSA can lead to better performance in applications like natural language processing, making AI systems more useful and accessible.

Abstract

Linear attention Transformers and their gated variants, celebrated for enabling parallel training and efficient recurrent inference, still fall short in recall-intensive tasks compared to traditional Transformers and demand significant resources for training from scratch. This paper introduces Gated Slot Attention (GSA), which enhances Attention with Bounded-memory-Control (ABC) by incorporating a gating mechanism inspired by Gated Linear Attention (GLA). Essentially, GSA comprises a two-layer GLA linked via softmax, utilizing context-aware memory reading and adaptive forgetting to improve memory capacity while maintaining compact recurrent state size. This design greatly enhances both training and inference efficiency through GLA's hardware-efficient training algorithm and reduced state size. Additionally, retaining the softmax operation is particularly beneficial in "finetuning pretrained Transformers to RNNs" (T2R) settings, reducing the need for extensive training from scratch. Extensive experiments confirm GSA's superior performance in scenarios requiring in-context recall and in T2R settings.

View Paper