LLaMA-Berry: Pairwise Optimization for O1-like Olympiad-Level Mathematical Reasoning

Di Zhang, Jianbo Wu, Jingdi Lei, Tong Che, Jiatong Li, Tong Xie, Xiaoshui Huang, Shufei Zhang, Marco Pavone, Yuqiang Li, Wanli Ouyang, Dongzhan Zhou

2024-10-08

LLaMA-Berry: Pairwise Optimization for O1-like Olympiad-Level Mathematical Reasoning

Summary

This paper introduces LLaMA-Berry, a new framework designed to improve the mathematical reasoning abilities of large language models (LLMs) by using advanced techniques to optimize problem-solving paths.

What's the problem?

Large language models often struggle with complex mathematical reasoning tasks, especially at high levels like Olympiad problems. Current methods typically rely on straightforward approaches that can be inefficient and may not effectively explore all possible solutions, leading to subpar performance in solving difficult math problems.

What's the solution?

To address these challenges, the authors developed LLaMA-Berry, which combines Monte Carlo Tree Search (MCTS) with a method called Self-Refine. This approach allows the model to explore different reasoning paths more effectively and evaluate them using a pairwise reward model. By focusing on the differences between potential solutions, LLaMA-Berry can better identify the most promising paths to take when solving a problem. The framework was tested on various advanced math benchmarks and showed significant improvements over existing methods.

Why it matters?

This research is important because it enhances the capabilities of language models in solving complex mathematical problems, which can have applications in education and automated problem-solving tools. By improving how these models reason and find solutions, LLaMA-Berry could help students learn math more effectively or assist professionals in fields that require advanced mathematical skills.

Abstract

This paper presents an advanced mathematical problem-solving framework, LLaMA-Berry, for enhancing the mathematical reasoning ability of Large Language Models (LLMs). The framework combines Monte Carlo Tree Search (MCTS) with iterative Self-Refine to optimize the reasoning path and utilizes a pairwise reward model to evaluate different paths globally. By leveraging the self-critic and rewriting capabilities of LLMs, Self-Refine applied to MCTS (SR-MCTS) overcomes the inefficiencies and limitations of conventional step-wise and greedy search algorithms by fostering a more efficient exploration of solution spaces. Pairwise Preference Reward Model~(PPRM), inspired by Reinforcement Learning from Human Feedback (RLHF), is then used to model pairwise preferences between solutions, utilizing an Enhanced Borda Count (EBC) method to synthesize these preferences into a global ranking score to find better answers. This approach addresses the challenges of scoring variability and non-independent distributions in mathematical reasoning tasks. The framework has been tested on general and advanced benchmarks, showing superior performance in terms of search efficiency and problem-solving capability compared to existing methods like ToT and rStar, particularly in complex Olympiad-level benchmarks, including GPQA, AIME24 and AMC23.

View Paper