Gated Delta Networks: Improving Mamba2 with Delta Rule

Songlin Yang, Jan Kautz, Ali Hatamizadeh

2024-12-10

Gated Delta Networks: Improving Mamba2 with Delta Rule

Summary

This paper talks about Gated Delta Networks, a new type of model that improves how linear transformers work by combining two techniques: gating for memory control and the delta update rule for precise memory changes.

What's the problem?

Linear transformers are efficient but often struggle with tasks that require remembering long sequences or retrieving information. They need better ways to manage memory so they can quickly forget irrelevant information while accurately updating important details.

What's the solution?

The authors introduce Gated DeltaNet, which uses a combination of gating and the delta update rule. Gating allows the model to quickly erase unnecessary memories, while the delta rule enables it to make precise updates to what it remembers. This combination makes the model faster and more effective at handling complex tasks like language modeling and reasoning. They also developed hybrid versions of the model that combine these features with other techniques for even better performance.

Why it matters?

This research is important because it enhances the capabilities of AI models in understanding and processing information over long periods. By improving memory management, Gated Delta Networks can lead to better performance in various applications, such as natural language processing, where understanding context and retrieving relevant information is crucial.

Abstract

Linear Transformers have gained attention as efficient alternatives to standard Transformers, but their performance in retrieval and long-context tasks has been limited. To address these limitations, recent work has explored two distinct mechanisms: gating for adaptive memory control and the delta update rule for precise memory modifications. We observe that these mechanisms are complementary: gating enables rapid memory erasure while the delta rule facilitates targeted updates. Building on this insight, we introduce the gated delta rule and develop a parallel training algorithm optimized for modern hardware. Our proposed architecture, Gated DeltaNet, consistently surpasses existing models like Mamba2 and DeltaNet across multiple benchmarks, including language modeling, common-sense reasoning, in-context retrieval, length extrapolation, and long-context understanding. We further enhance performance by developing hybrid architectures that combine Gated DeltaNet layers with sliding window attention or Mamba2 layers, achieving both improved training efficiency and superior task performance.

View Paper