Flow-DPO: Improving LLM Mathematical Reasoning through Online Multi-Agent Learning

Yihe Deng, Paul Mineiro

2024-10-30

Flow-DPO: Improving LLM Mathematical Reasoning through Online Multi-Agent Learning

Summary

This paper introduces Flow-DPO, a new method to improve how Large Language Models (LLMs) handle mathematical reasoning by using a collaborative learning approach among multiple models.

What's the problem?

Generating detailed and accurate reasoning steps for math problems is a big challenge for LLMs. While these models can produce answers, they often struggle to show the logical steps that lead to those answers, which is crucial for understanding and verification.

What's the solution?

To tackle this issue, the authors developed Flow-DPO, which uses a system where several LLMs work together to create solutions. These models communicate and build on each other's outputs in real-time, allowing them to refine their reasoning. The training process involves online learning called Direct Preference Optimization (DPO), where the models learn from their successes and failures as they go. This method allows them to generate better reasoning traces compared to traditional methods that rely solely on the model's initial capabilities.

Why it matters?

This research is important because it shows a new way to enhance the reasoning abilities of AI models, making them more reliable for tasks that require mathematical understanding. By improving how LLMs explain their reasoning, Flow-DPO could help in educational settings and other areas where clear problem-solving processes are essential.

Abstract

Mathematical reasoning is a crucial capability for Large Language Models (LLMs), yet generating detailed and accurate reasoning traces remains a significant challenge. This paper introduces a novel approach to produce high-quality reasoning traces for LLM fine-tuning using online learning Flows. Our method employs an incremental output production Flow, where component LLMs collaboratively construct solutions through iterative communication. We train the Flow using online Direct Preference Optimization (DPO) learning with rollouts, generating DPO pairs for each training example and updating models in real-time. We directly compare the quality of reasoning traces generated by our method with those produced through direct model inference, demonstrating the effectiveness of our approach in improving LLM performance in mathematical reasoning tasks.

View Paper