Thinking Preference Optimization

Wang Yang, Hongye Jin, Jingfeng Yang, Vipin Chaudhary, Xiaotian Han

2025-02-20

Summary

This paper talks about ThinkPO, a new method to improve how AI models reason through complex problems. It's designed to make AI think more deeply and give longer, more detailed answers, especially after they've already been trained on high-quality examples.

What's the problem?

Current methods for improving AI reasoning, like Supervised Fine-Tuning (SFT), eventually hit a limit. Getting new high-quality training data is expensive and hard, and just repeating the training on the same data doesn't keep improving the AI's performance. It's like trying to teach a student to write better essays, but running out of good example essays to show them.

What's the solution?

The researchers created ThinkPO, which works after the initial training. Instead of needing new long, complex reasoning examples, it uses shorter, simpler answers as examples of what not to do. Then it encourages the AI to prefer longer, more detailed reasoning. It's like teaching a student to expand on their ideas by showing them both short and long answers, and helping them understand why the longer, more detailed answers are better.

Why it matters?

This matters because it can make AI systems better at solving complex problems and explaining their reasoning, without needing lots of new, expensive training data. By improving how AI thinks through problems, ThinkPO could lead to more capable AI assistants in fields like math, science, or any area that requires detailed reasoning. It's a step towards making AI not just give answers, but explain its thinking process in a way that's more helpful and understandable to humans.

Abstract

Supervised Fine-Tuning (SFT) has been a go-to and effective method for enhancing long chain-of-thought (CoT) reasoning in relatively small LLMs by fine-tuning them with long CoT responses from larger LLMs. To continually improve reasoning abilities, we can either collect new high-quality long CoT reasoning SFT data or repeatedly train on existing SFT datasets. However, acquiring new long CoT SFT data is costly and limited, while repeated training often results in a performance plateau or decline. To further boost the performance with the SFT data, we propose Thinking Preference Optimization (ThinkPO), a simple yet effective post-SFT method that enhances long CoT reasoning without requiring new long CoT responses. Instead, ThinkPO utilizes readily available or easily obtainable short CoT reasoning responses as rejected answers and long CoT responses as chosen answers for the same question. It then applies direct preference optimization to encourage the model to favor longer reasoning outputs. Experiments show that ThinkPO further improves the reasoning performance of SFT-ed models, e.g. it increases math reasoning accuracy of SFT-ed models by 8.6% and output length by 25.9%. Notably, ThinkPO is capable of continually boosting the performance of the publicly distilled SFT model, e.g., increasing the official DeepSeek-R1-Distill-Qwen-7B's performance on MATH500 from 87.4% to 91.2%.

View Paper