A Minimalist Approach to LLM Reasoning: from Rejection Sampling to Reinforce
Wei Xiong, Jiarui Yao, Yuhui Xu, Bo Pang, Lei Wang, Doyen Sahoo, Junnan Li, Nan Jiang, Tong Zhang, Caiming Xiong, Hanze Dong
2025-04-16
Summary
This paper talks about a new, simpler way to improve how large language models learn to reason and make decisions, using a method called Reinforce-Rej that tweaks how the models are trained.
What's the problem?
The problem is that current training methods for large language models, like GRPO and PPO, can be inefficient and sometimes unstable when trying to teach the models to reason better. These methods often struggle to pick out the best examples to learn from, which slows down progress and can make the models less reliable.
What's the solution?
The researchers introduced Reinforce-Rej, a new technique that builds on an existing method called policy gradient but adds a smart filtering step. This step throws out not only the completely wrong answers but also the already correct ones, so the model focuses on learning from the most useful examples. This makes the training process both faster and more stable, and it leads to better results compared to the older methods.
Why it matters?
This matters because it helps large language models get better at reasoning in a more efficient and reliable way. By making the training process simpler and more focused, Reinforce-Rej can help create smarter AI systems that learn faster and make fewer mistakes, which is important for all kinds of real-world applications.
Abstract
Reinforce-Rej, a minimal extension of policy gradient, outperforms GRPO and PPO in fine-tuning large language models by filtering out both entirely incorrect and correct samples, improving efficiency and stability.