Step-KTO: Optimizing Mathematical Reasoning through Stepwise Binary Feedback

Yen-Ting Lin, Di Jin, Tengyu Xu, Tianhao Wu, Sainbayar Sukhbaatar, Chen Zhu, Yun He, Yun-Nung Chen, Jason Weston, Yuandong Tian, Arash Rahnama, Sinong Wang, Hao Ma, Han Fang

2025-01-24

Step-KTO: Optimizing Mathematical Reasoning through Stepwise Binary Feedback

Summary

This paper talks about Step-KTO, a new way to train AI models to be better at solving math problems. It's like teaching a computer to show its work, not just give the final answer.

What's the problem?

Current AI models are getting good at solving math problems, but they often just focus on getting the right answer without explaining how they got there. It's like a student who always gets the correct answer but can't explain their reasoning. This makes it hard to trust if the AI really understands the problem or is just guessing.

What's the solution?

The researchers created Step-KTO, which is like a special training program for AI. It teaches the AI to break down math problems into steps and checks if each step makes sense, not just if the final answer is right. They do this by giving the AI feedback on both its reasoning steps and the final answer. It's like a teacher who checks your work at each step, not just the final result.

Why it matters?

This matters because it could make AI much more trustworthy and useful for solving complex problems. If an AI can show its work and explain its reasoning, just like a good student, we can better understand how it thinks and where it might make mistakes. This could be really helpful in fields like science, engineering, or finance where understanding the process is just as important as getting the right answer. It's a big step towards creating AI that doesn't just memorize answers, but actually understands and can explain complex ideas.

Abstract

Large language models (LLMs) have recently demonstrated remarkable success in mathematical reasoning. Despite progress in methods like chain-of-thought prompting and self-consistency sampling, these advances often focus on final correctness without ensuring that the underlying reasoning process is coherent and reliable. This paper introduces Step-KTO, a training framework that combines process-level and outcome-level binary feedback to guide LLMs toward more trustworthy reasoning trajectories. By providing binary evaluations for both the intermediate reasoning steps and the final answer, Step-KTO encourages the model to adhere to logical progressions rather than relying on superficial shortcuts. Our experiments on challenging mathematical benchmarks show that Step-KTO significantly improves both final answer accuracy and the quality of intermediate reasoning steps. For example, on the MATH-500 dataset, Step-KTO achieves a notable improvement in Pass@1 accuracy over strong baselines. These results highlight the promise of integrating stepwise process feedback into LLM training, paving the way toward more interpretable and dependable reasoning capabilities.

View Paper