Scaling Latent Reasoning via Looped Language Models

Rui-Jie Zhu, Zixuan Wang, Kai Hua, Tianyu Zhang, Ziniu Li, Haoran Que, Boyi Wei, Zixin Wen, Fan Yin, He Xing, Lu Li, Jiajun Shi, Kaijing Ma, Shanda Li, Taylor Kergan, Andrew Smith, Xingwei Qu, Mude Hui, Bohong Wu, Qiyang Min, Hongzhi Huang, Xun Zhou

2025-10-30

Scaling Latent Reasoning via Looped Language Models

Summary

This paper introduces Ouro, a new type of language model that improves reasoning abilities by changing *how* it learns, rather than just making it bigger. It's a step away from models that 'think' by generating text step-by-step after they've already been trained.

What's the problem?

Current large language models (LLMs) mostly rely on techniques like 'chain-of-thought' where they generate text to show their reasoning. This happens *after* the model is initially trained and doesn't fully utilize all the information the model learned during its original training phase. Essentially, they're good at remembering facts, but not necessarily at skillfully *using* those facts to solve problems.

What's the solution?

The researchers created 'LoopLM,' which builds reasoning directly into the pre-training process. Instead of thinking in words, LoopLM works with ideas in a hidden 'latent space,' repeatedly refining its understanding internally. They also developed a way to control how 'deep' the model thinks, and trained it on a massive amount of text – 7.7 trillion tokens. The resulting models, even smaller ones like 1.4B and 2.6B parameters, perform as well as much larger models (up to 12B parameters) on various tests.

Why it matters?

This work is important because it shows a new path for improving LLMs. Instead of just scaling up model size or adding reasoning steps *after* training, Ouro demonstrates that building reasoning into the core learning process can lead to more effective knowledge manipulation and better overall performance. It suggests that focusing on *how* models learn, not just *how much* they learn, is key to creating truly intelligent systems.

Abstract

Modern LLMs are trained to "think" primarily via explicit text generation, such as chain-of-thought (CoT), which defers reasoning to post-training and under-leverages pre-training data. We present and open-source Ouro, named after the recursive Ouroboros, a family of pre-trained Looped Language Models (LoopLM) that instead build reasoning into the pre-training phase through (i) iterative computation in latent space, (ii) an entropy-regularized objective for learned depth allocation, and (iii) scaling to 7.7T tokens. Ouro 1.4B and 2.6B models enjoy superior performance that match the results of up to 12B SOTA LLMs across a wide range of benchmarks. Through controlled experiments, we show this advantage stems not from increased knowledge capacity, but from superior knowledge manipulation capabilities. We also show that LoopLM yields reasoning traces more aligned with final outputs than explicit CoT. We hope our results show the potential of LoopLM as a novel scaling direction in the reasoning era. Our model could be found in: http://ouro-llm.github.io.

View Paper