TrajSelector: Harnessing Latent Representations for Efficient and Effective Best-of-N in Large Reasoning Model
Bin Yu, Xinming Wang, Shijie Lian, Haotian Li, Changti Wu, Ruina Hu, Bailing Wang, Yuliang Wei, Kai Chen
2025-10-21
Summary
This paper focuses on improving how large language models (LLMs) make decisions, specifically when they need to think through a problem step-by-step. It introduces a new method for choosing the best reasoning path out of many possibilities generated by the LLM.
What's the problem?
Currently, a common technique called 'test-time scaling' helps LLMs by letting them generate multiple solutions and then picking the best one. However, this is expensive because it requires another model to judge the quality of each step in the reasoning process. Also, it doesn't fully utilize the information already contained *within* the LLM itself – the hidden patterns it learns during training.
What's the solution?
The researchers developed a system called TrajSelector. Instead of using a separate, complex model to evaluate each step, TrajSelector uses a smaller, more efficient 'verifier' model (with relatively few parameters) to score the reasoning steps *as they are being generated* by the main LLM. This verifier looks at the LLM’s internal workings – its 'hidden states' – to assess the quality of each step and then combines those scores to pick the best overall reasoning path. Importantly, this system learns directly from data without needing humans to label every single step of the reasoning process.
Why it matters?
This work is important because it makes LLMs more accurate and efficient at complex reasoning tasks. TrajSelector achieves better results than existing methods, often by a significant margin, while also being cheaper to run. This means we can get more reliable answers from LLMs without needing massive amounts of computing power, making them more practical for real-world applications.
Abstract
Large language models (LLMs) have shown remarkable progress in complex reasoning tasks, largely enabled by test-time scaling (TTS) paradigms that allocate additional compute during inference. Among these, external TTS (particularly the Best-of-N selection paradigm) yields scalable performance improvements by selecting from multiple independently generated reasoning trajectories. However, this approach faces key limitations: (i) the high computational overhead of deploying process reward models, (ii) the underutilization of the LLM's intrinsic latent representations. We introduce TrajSelector, an efficient and effective Best-of-N framework that exploit the hidden states in the sampler LLM for process-level scoring. A lightweight verifier (with only 0.6B parameters) evaluates the quality of step-wise trajectory, and then aggregates these scores to identify the optimal reasoning trajectory. Our framework employs a fully data-driven, end-to-end training recipe that eliminates reliance on massive step-level annotations. Experiential results across five benchmarks demonstrate that TrajSelector delivers consistent performance gains. In Best-of-32 settings, it surpasses majority voting by 4.61% accuracy and outperforms existing process reward models by 4.31% to 12.21%, all while maintaining lower inference costs.