Step-Controlled DPO: Leveraging Stepwise Error for Enhanced Mathematical Reasoning

Zimu Lu, Aojun Zhou, Ke Wang, Houxing Ren, Weikang Shi, Junting Pan, Mingjie Zhan

2024-07-02

Step-Controlled DPO: Leveraging Stepwise Error for Enhanced Mathematical Reasoning

Summary

This paper talks about a new method called Step-Controlled DPO (SCDPO) that helps improve how large language models (LLMs) understand and solve mathematical problems. It focuses on using stepwise error information to enhance the reasoning abilities of these models.

What's the problem?

Large language models are often used to solve complex math problems, but they can struggle with reasoning through the steps needed to arrive at the correct answer. Traditional methods for training these models might not effectively help them learn from their mistakes, especially when it comes to understanding where they went wrong in their reasoning process.

What's the solution?

To address this issue, the authors developed SCDPO, which automatically generates examples of incorrect reasoning at specific steps in a problem-solving process. By including these 'negative samples' during training, the model learns not only from correct answers but also from understanding the errors made along the way. This helps the model improve its reasoning skills and output more accurate solutions. The authors tested SCDPO on various models and found that it consistently performed better than traditional methods, achieving high scores on math problem sets like GSM8K and MATH.

Why it matters?

This research is important because it enhances how AI models learn to reason mathematically, making them more effective at solving complex problems. By focusing on stepwise errors, SCDPO can help create more reliable AI systems that can assist with educational tools, tutoring, and other applications where accurate mathematical reasoning is crucial.

Abstract

Direct Preference Optimization (DPO) has proven effective at improving the performance of large language models (LLMs) on downstream tasks such as reasoning and alignment. In this work, we propose Step-Controlled DPO (SCDPO), a method for automatically providing stepwise error supervision by creating negative samples of mathematical reasoning rationales that start making errors at a specified step. By applying these samples in DPO training, SCDPO can better align the model to understand reasoning errors and output accurate reasoning steps. We apply SCDPO to both code-integrated and chain-of-thought solutions, empirically showing that it consistently improves the performance compared to naive DPO on three different SFT models, including one existing SFT model and two models we finetuned. Qualitative analysis of the credit assignment of SCDPO and DPO demonstrates the effectiveness of SCDPO at identifying errors in mathematical solutions. We then apply SCDPO to an InternLM2-20B model, resulting in a 20B model that achieves high scores of 88.5% on GSM8K and 58.1% on MATH, rivaling all other open-source LLMs, showing the great potential of our method.

View Paper