Parallel Test-Time Scaling for Latent Reasoning Models

Runyang You, Yongqi Li, Meng Liu, Wenjie Wang, Liqiang Nie, Wenjie Li

2025-10-13

Parallel Test-Time Scaling for Latent Reasoning Models

Summary

This paper explores how to speed up the process of getting answers from advanced language models that don't rely on step-by-step thinking written out in words, but instead work with ideas represented as numbers. It's about making these models faster and more reliable when they're tackling complex problems.

What's the problem?

Large language models are powerful, but can be slow when solving difficult problems. A common technique to improve them, called 'test-time scaling,' involves trying out multiple different lines of reasoning and then combining the results. However, this technique usually works by generating actual text-based steps, which is inefficient. Newer models represent reasoning internally as continuous values, but it's been unclear if test-time scaling could even *work* for these models because there's no easy way to generate multiple 'attempts' at reasoning in this numerical space, and no clear way to decide which attempt is best.

What's the solution?

The researchers developed a way to apply test-time scaling to these 'latent reasoning' models. They introduced two methods for creating multiple reasoning attempts: one that randomly turns off parts of the model during processing (like Monte Carlo Dropout), and another that adds random noise to the calculations (Additive Gaussian Noise). To figure out which reasoning attempt is best, they created a 'Latent Reward Model' that learns to score the quality of each attempt based on how well it progresses towards a solution. This model is trained to recognize good reasoning steps.

Why it matters?

This work is important because it opens up a new way to make these advanced, numerically-based language models much faster and more efficient. It shows that test-time scaling isn't limited to models that think in words, and provides tools for effectively using multiple reasoning paths in a continuous space, which could lead to significant improvements in performance for a wide range of tasks.

Abstract

Parallel test-time scaling (TTS) is a pivotal approach for enhancing large language models (LLMs), typically by sampling multiple token-based chains-of-thought in parallel and aggregating outcomes through voting or search. Recent advances in latent reasoning, where intermediate reasoning unfolds in continuous vector spaces, offer a more efficient alternative to explicit Chain-of-Thought, yet whether such latent models can similarly benefit from parallel TTS remains open, mainly due to the absence of sampling mechanisms in continuous space, and the lack of probabilistic signals for advanced trajectory aggregation. \ This work enables parallel TTS for latent reasoning models by addressing the above issues. For sampling, we introduce two uncertainty-inspired stochastic strategies: Monte Carlo Dropout and Additive Gaussian Noise. For aggregation, we design a Latent Reward Model (LatentRM) trained with step-wise contrastive objective to score and guide latent reasoning. Extensive experiments and visualization analyses show that both sampling strategies scale effectively with compute and exhibit distinct exploration dynamics, while LatentRM enables effective trajectory selection. Together, our explorations open a new direction for scalable inference in continuous spaces. Code released at https://github.com/YRYangang/LatentTTS.

View Paper