< Explain other AI papers

GCPO: When Contrast Fails, Go Gold

Hao Wu, Wei Liu

2025-10-10

GCPO: When Contrast Fails, Go Gold

Summary

This paper introduces a new way to train large language models to improve their reasoning skills, specifically focusing on making smaller models better at complex tasks.

What's the problem?

Current methods for improving language model reasoning, like Group Relative Policy Optimization, have a limitation: they can't learn from examples where the model consistently gets the answer right or wrong. Essentially, the model is only guided by its own attempts, and if those are all flawed, it can't improve, and if they're all correct, it doesn't learn anything new.

What's the solution?

The researchers developed a technique called Group Contrastive Policy Optimization (GCPO). This method uses external, known-correct answers as a guide when the model struggles. If the model can't solve a problem, GCPO incorporates the correct answer to push the model in the right direction. This ensures every attempt, even incorrect ones, contributes to learning and allows the model to learn *how* to solve problems by observing good examples.

Why it matters?

This is important because it makes training more efficient – no learning opportunity is wasted. More significantly, it helps smaller language models achieve better reasoning abilities by letting them learn from a reliable source of correct answers and mimicking effective problem-solving strategies, ultimately improving their performance on various reasoning tasks.

Abstract

Reinforcement learning has been widely applied to enhance the reasoning capabilities of large language models. Extending the inference limits of smaller models has become a prominent research focus. However, algorithms such as Group Relative Policy Optimization (GRPO) suffer from a clear drawback: the upper bound of a model's rollout responses is entirely determined by the model itself, preventing the acquisition of knowledge from samples that are either all incorrect or all correct. In this paper, we introduce Group Contrastive Policy Optimization (GCPO), a method that incorporates external standard reference answers. When the model cannot solve a problem, the reference answer supplies the correct response, steering the model toward an unequivocally accurate update direction. This approach offers two main advantages: (1) it improves training efficiency by fully utilizing every sample; (2) it enables the model to emulate the problem solving strategy of the reference answer during training, thereby enhancing generalization in reasoning. GCPO achieves outstanding results across multiple benchmark datasets, yielding substantial improvements over the baseline model. Our code is available at: https://github.com/AchoWu/GCPO.