Building Math Agents with Multi-Turn Iterative Preference Learning
Wei Xiong, Chengshuai Shi, Jiaming Shen, Aviv Rosenberg, Zhen Qin, Daniele Calandriello, Misha Khalman, Rishabh Joshi, Bilal Piot, Mohammad Saleh, Chi Jin, Tong Zhang, Tianqi Liu
2024-09-06

Summary
This paper talks about a new method for improving how language models solve math problems by using a technique called multi-turn iterative preference learning.
What's the problem?
While large language models (LLMs) have become good at solving math problems, they often need extra tools and methods to improve their reasoning. Current approaches mainly focus on generating synthetic data and fine-tuning the models, but they don’t fully address the complexities of solving problems that require multiple steps or interactions.
What's the solution?
The authors introduce a new framework that focuses on direct preference learning specifically for multi-turn interactions. This means the model learns from feedback over several exchanges, allowing it to better integrate tools like code interpreters while solving math problems. They implemented this framework with two specific methods: multi-turn DPO and multi-turn KTO, and tested it using data from math problem datasets like GSM8K and MATH. The results showed significant improvements in the models' performance on these tasks.
Why it matters?
This research is important because it enhances the ability of language models to solve complex math problems more effectively. By improving how these models learn from interactions, it can lead to better educational tools and applications that help students understand math concepts more easily.
Abstract
Recent studies have shown that large language models' (LLMs) mathematical problem-solving capabilities can be enhanced by integrating external tools, such as code interpreters, and employing multi-turn Chain-of-Thought (CoT) reasoning. While current methods focus on synthetic data generation and Supervised Fine-Tuning (SFT), this paper studies the complementary direct preference learning approach to further improve model performance. However, existing direct preference learning algorithms are originally designed for the single-turn chat task, and do not fully address the complexities of multi-turn reasoning and external tool integration required for tool-integrated mathematical reasoning tasks. To fill in this gap, we introduce a multi-turn direct preference learning framework, tailored for this context, that leverages feedback from code interpreters and optimizes trajectory-level preferences. This framework includes multi-turn DPO and multi-turn KTO as specific implementations. The effectiveness of our framework is validated through training of various language models using an augmented prompt set from the GSM8K and MATH datasets. Our results demonstrate substantial improvements: a supervised fine-tuned Gemma-1.1-it-7B model's performance increased from 77.5% to 83.9% on GSM8K and from 46.1% to 51.2% on MATH. Similarly, a Gemma-2-it-9B model improved from 84.1% to 86.3% on GSM8K and from 51.0% to 54.5% on MATH.