Forgetting Transformer: Softmax Attention with a Forget Gate

Zhixuan Lin, Evgenii Nikishin, Xu Owen He, Aaron Courville

2025-03-09

Forgetting Transformer: Softmax Attention with a Forget Gate

Summary

This paper talks about a new type of AI model called the Forgetting Transformer (FoX), which adds a 'forget gate' to regular Transformer models to help them handle long pieces of text more efficiently

What's the problem?

Regular Transformer models, which are used in many AI language tasks, struggle with processing very long texts because they have to remember everything they've seen before. This can make them slow and use up a lot of computer memory

What's the solution?

The researchers created FoX by adding a forget gate to Transformers. This gate helps the model decide which parts of the text are important to remember and which can be forgotten. They also made sure FoX works with existing fast attention algorithms and doesn't need special position markers. Additionally, they created a 'Pro' version that borrows ideas from other types of AI models to make both FoX and regular Transformers work even better

Why it matters?

This matters because it could make AI language models much better at handling long texts, like entire books or long conversations, without slowing down or needing huge amounts of memory. This could lead to more efficient and capable AI assistants, better language translation for long documents, and improved performance on many language-related tasks

Abstract

An essential component of modern recurrent sequence models is the forget gate. While Transformers do not have an explicit recurrent form, we show that a forget gate can be naturally incorporated into Transformers by down-weighting the unnormalized attention scores in a data-dependent way. We name this attention mechanism the Forgetting Attention and the resulting model the Forgetting Transformer (FoX). We show that FoX outperforms the Transformer on long-context language modeling, length extrapolation, and short-context downstream tasks, while performing on par with the Transformer on long-context downstream tasks. Moreover, it is compatible with the FlashAttention algorithm and does not require any positional embeddings. Several analyses, including the needle-in-the-haystack test, show that FoX also retains the Transformer's superior long-context capabilities over recurrent sequence models such as Mamba-2, HGRN2, and DeltaNet. We also introduce a "Pro" block design that incorporates some common architectural components in recurrent sequence models and find it significantly improves the performance of both FoX and the Transformer. Our code is available at https://github.com/zhixuan-lin/forgetting-transformer.

View Paper