Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation

Sangmin Bae, Yujin Kim, Reza Bayat, Sungnyun Kim, Jiyoun Ha, Tal Schuster, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Aaron Courville, Se-Young Yun

2025-07-15

Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive
Token-Level Computation

Summary

This paper talks about Mixture-of-Recursions (MoR), a new type of AI model that changes how transformers work by letting the model decide how much thinking each part of the input needs, using a smart system that shares layers but repeats them differently for different tokens.

What's the problem?

Traditional transformers treat every part of the input the same way, which can be wasteful because simple parts don’t need as much processing, but complex parts need more deep thinking, leading to inefficiency and more memory use.

What's the solution?

The researchers designed MoR to reuse the same layers multiple times selectively for each token based on its complexity. A router inside the model decides if a token should be processed more times for deeper understanding or fewer times to save resources. This saves memory, speeds up processing, and improves performance without making the model bigger.

Why it matters?

This matters because MoR makes powerful language models much more efficient and smart about their thinking process, allowing smaller models to perform as well as bigger ones while using less computation and memory, which is important for making AI more accessible and faster.

Abstract

Mixture-of-Recursions (MoR) combines parameter sharing and adaptive computation in a Recursive Transformer to improve efficiency, reduce memory usage, and enhance performance across different model scales.

View Paper