MoM: Linear Sequence Modeling with Mixture-of-Memories

Jusen Du, Weigao Sun, Disen Lan, Jiaxi Hu, Yu Cheng

2025-02-20

MoM: Linear Sequence Modeling with Mixture-of-Memories

Summary

This paper talks about a new way to make AI models better at remembering and processing long sequences of information, called Mixture-of-Memories (MoM). It's like giving a computer multiple notebooks to write in, instead of just one, so it can keep track of more information without getting confused.

What's the problem?

Current AI models that process sequences of data, like text or time series, often struggle with remembering important information from earlier in the sequence. They try to squeeze all the information into one fixed-size memory, which can lead to forgetting or mixing up details, especially in tasks that require recalling specific information from a long time ago.

What's the solution?

The researchers created MoM, which uses multiple separate memory states instead of just one. It's inspired by how the human brain manages memories. MoM has a special 'router' that decides which memory to store each piece of information in, kind of like organizing notes into different folders. This helps the AI remember more and avoid mixing up information. Even though it uses multiple memories, MoM is still efficient to train and use.

Why it matters?

This matters because it could make AI much better at tasks that require remembering and using information from long sequences, like understanding long documents or conversations. MoM performs better than other similar models and even comes close to more complex models in some tasks. This could lead to more efficient and capable AI systems for things like language translation, summarization, or any task where remembering context over long periods is important.

Abstract

Linear sequence modeling methods, such as linear attention, state space modeling, and linear RNNs, offer significant efficiency improvements by reducing the complexity of training and inference. However, these methods typically compress the entire input sequence into a single fixed-size memory state, which leads to suboptimal performance on recall-intensive downstream tasks. Drawing inspiration from neuroscience, particularly the brain's ability to maintain robust long-term memory while mitigating "memory interference", we introduce a novel architecture called Mixture-of-Memories (MoM). MoM utilizes multiple independent <PRE_TAG>memory states</POST_TAG>, with a router network directing input tokens to specific memory states. This approach greatly enhances the overall memory capacity while minimizing memory interference. As a result, MoM performs exceptionally well on recall-intensive tasks, surpassing existing linear sequence modeling techniques. Despite incorporating multiple memory states, the computation of each memory state remains linear in complexity, allowing MoM to retain the linear-complexity advantage during training, while constant-complexity during inference. Our experimental results show that MoM significantly outperforms current linear sequence models on downstream language tasks, particularly recall-intensive tasks, and even achieves performance comparable to Transformer models. The code is released at https://github.com/OpenSparseLLMs/MoM and is also released as a part of https://github.com/OpenSparseLLMs/Linear-MoE.

View Paper