UltraMemV2: Memory Networks Scaling to 120B Parameters with Superior Long-Context Learning
Zihao Huang, Yu Bao, Qiyang Min, Siyan Chen, Ran Guo, Hongzhi Huang, Defa Zhu, Yutao Zeng, Banggu Wu, Xun Zhou, Siyuan Qiao
2025-08-27
Summary
This paper introduces UltraMemV2, a new way to build powerful AI models that are efficient in terms of memory usage. It's an improvement on previous 'memory-layer' designs and aims to compete with the popular 'Mixture of Experts' approach.
What's the problem?
Current large AI models, especially those using 'Mixture of Experts', are really good but require a lot of memory access, which slows things down. An alternative called 'memory-layer' architectures uses much less memory access, but older versions like UltraMem weren't as accurate as the best Mixture of Experts models, particularly when dealing with complex tasks and a large number of experts.
What's the solution?
The researchers redesigned the memory-layer architecture with several key changes. They added memory layers to every part of the model, simplified how information is stored and retrieved in memory, borrowed a technique for processing information from another model called PEER, carefully set the initial values of the model's parameters, and balanced how much computation is done in the memory layers versus other parts of the model. These changes together create UltraMemV2.
Why it matters?
This work is important because it shows that memory-layer architectures can now perform just as well as the leading Mixture of Experts models, but with significantly less memory access. This means we can build powerful AI models that are faster and more efficient, especially for tasks that require remembering a lot of information or learning from examples. It also suggests that *how* you activate parts of the model (activation density) is more important than just *how many* parts are available.
Abstract
While Mixture of Experts (MoE) models achieve remarkable efficiency by activating only subsets of parameters, they suffer from high memory access costs during inference. Memory-layer architectures offer an appealing alternative with very few memory access, but previous attempts like UltraMem have only matched the performance of 2-expert MoE models, falling significantly short of state-of-the-art 8-expert configurations. We present UltraMemV2, a redesigned memory-layer architecture that closes this performance gap. Our approach introduces five key improvements: integrating memory layers into every transformer block, simplifying value expansion with single linear projections, adopting FFN-based value processing from PEER, implementing principled parameter initialization, and rebalancing memory-to-FFN computation ratios. Through extensive evaluation, we demonstrate that UltraMemV2 achieves performance parity with 8-expert MoE models under same computation and parameters but significantly low memory access. Notably, UltraMemV2 shows superior performance on memory-intensive tasks, with improvements of +1.6 points on long-context memorization, +6.2 points on multi-round memorization, and +7.9 points on in-context learning. We validate our approach at scale with models up to 2.5B activated parameters from 120B total parameters, and establish that activation density has greater impact on performance than total sparse parameter count. Our work brings memory-layer architectures to performance parity with state-of-the-art MoE models, presenting a compelling alternative for efficient sparse computation.