A Probabilistic Inference Approach to Inference-Time Scaling of LLMs using Particle-Based Monte Carlo Methods

Isha Puri, Shivchander Sudalairaj, Guangxuan Xu, Kai Xu, Akash Srivastava

2025-02-06

A Probabilistic Inference Approach to Inference-Time Scaling of LLMs
using Particle-Based Monte Carlo Methods

Summary

This paper talks about a new way to make large language models (LLMs) work better without making them bigger. It uses a method called particle-based Monte Carlo to improve how LLMs think and solve problems, especially in math.

What's the problem?

Making LLMs bigger and giving them more data isn't working as well as it used to. Current methods for improving LLMs during use (inference time) can be tricked into giving wrong answers because they rely on imperfect reward systems.

What's the solution?

The researchers came up with a new approach that treats the problem like a probability puzzle. They use a technique called particle-based Monte Carlo, which is like having many different versions of the LLM work on a problem at once and then picking the best answer. This method helps the LLM explore more possible solutions without getting stuck on wrong answers.

Why it matters?

This matters because it shows we can make AI models much smarter without needing more powerful computers or huge amounts of data. The new method makes smaller AI models perform as well as much larger ones, which could make AI technology more accessible and reduce its environmental impact. It also connects different areas of math and computer science, which could lead to even better AI systems in the future.

Abstract

Large language models (LLMs) have achieved significant performance gains via scaling up model sizes and/or data. However, recent evidence suggests diminishing returns from such approaches, motivating scaling the computation spent at inference time. Existing inference-time scaling methods, usually with reward models, cast the task as a search problem, which tends to be vulnerable to reward hacking as a consequence of approximation errors in reward models. In this paper, we instead cast inference-time scaling as a probabilistic inference task and leverage sampling-based techniques to explore the typical set of the state distribution of a state-space model with an approximate likelihood, rather than optimize for its mode directly. We propose a novel inference-time scaling approach by adapting particle-based Monte Carlo methods to this task. Our empirical evaluation demonstrates that our methods have a 4-16x better scaling rate over our deterministic search counterparts on various challenging mathematical reasoning tasks. Using our approach, we show that Qwen2.5-Math-1.5B-Instruct can surpass GPT-4o accuracy in only 4 rollouts, while Qwen2.5-Math-7B-Instruct scales to o1 level accuracy in only 32 rollouts. Our work not only presents an effective method to inference-time scaling, but also connects the rich literature in probabilistic inference with inference-time scaling of LLMs to develop more robust algorithms in future work. Code and further information is available at https://probabilistic-inference-scaling.github.io.

View Paper