CARFT: Boosting LLM Reasoning via Contrastive Learning with Annotated Chain-of-Thought-based Reinforced Fine-Tuning
Wenqiao Zhu, Ji Liu, Rongjuncheng Zhang, Haipang Wu, Yulun Zhang
2025-08-25
Summary
This paper focuses on improving the reasoning abilities of large language models (LLMs), which are crucial for many applications. It introduces a new method to fine-tune these models, making them better at complex problem-solving.
What's the problem?
Currently, LLMs are often trained using two main methods: supervised fine-tuning and reinforcement learning. Supervised fine-tuning, while helpful, doesn't always allow the model to generalize well to new situations. Reinforcement learning can improve reasoning, but it's often unstable and can lead to the model performing poorly. Existing methods also either don't fully utilize helpful 'Chain-of-Thought' examples, or they rely too heavily on them, potentially limiting the model's potential.
What's the solution?
The researchers developed a new technique called Contrastive learning with annotated CoT-based Reinforced Fine-Tuning. Essentially, they teach the model to create a kind of 'understanding' of each step in a Chain-of-Thought process. Then, they use this understanding to guide the model during fine-tuning, encouraging it to learn from the examples while also exploring other possible reasoning paths. This approach stabilizes the training and allows the model to make better use of the available data.
Why it matters?
This research is important because it addresses key weaknesses in how we currently train LLMs. By making the training process more stable and allowing the models to better utilize reasoning examples, this new method leads to significant improvements in performance, robustness, and efficiency. This means LLMs can become more reliable and effective tools for a wider range of tasks.
Abstract
Reasoning capability plays a significantly critical role in the the broad applications of Large Language Models (LLMs). To enhance the reasoning performance of LLMs, diverse Reinforcement Learning (RL)-based fine-tuning approaches have been proposed to address the limited generalization capability of LLMs trained solely via Supervised Fine-Tuning (SFT). Despite their effectiveness, two major limitations hinder the advancement of LLMs. First, vanilla RL-based approaches ignore annotated Chain-of-Thought (CoT) and incorporate unstable reasoning path sampling, which typically results in model collapse, unstable training process, and suboptimal performance. Second, existing SFT approaches generally overemphasize the annotated CoT, potentially leading to performance degradation due to insufficient exploitation of potential CoT. In this paper, we propose a Contrastive learning with annotated CoT-based Reinforced Fine-Tuning approach, i.e., , to enhance the reasoning performance of LLMs while addressing the aforementioned limitations. Specifically, we propose learning a representation for each CoT. Based on this representation, we design novel contrastive signals to guide the fine-tuning process. Our approach not only fully exploits the available annotated CoT but also stabilizes the fine-tuning procedure by incorporating an additional unsupervised learning signal. We conduct comprehensive experiments and in-depth analysis with three baseline approaches, two foundation models, and two datasets to demonstrate significant advantages of in terms of robustness, performance (up to 10.15\%), and efficiency (up to 30.62\%). Code is available at https://github.com/WNQzhu/CARFT.