Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning

Tian Xie, Zitian Gao, Qingnan Ren, Haoming Luo, Yuqian Hong, Bryan Dai, Joey Zhou, Kai Qiu, Zhirong Wu, Chong Luo

2025-02-21

Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement
Learning

Summary

This paper talks about Logic-RL, a new way to make AI language models better at logical reasoning using a method called rule-based reinforcement learning. It's like teaching a computer to solve puzzles by giving it practice and rewards for good thinking.

What's the problem?

AI language models are really good at many tasks, but they often struggle with complex logical reasoning. It's like having a super-smart friend who can memorize lots of facts but has trouble solving tricky puzzles or math problems that require step-by-step thinking.

What's the solution?

The researchers created Logic-RL, which uses special logic puzzles to train AI models. They made the AI focus on showing its work, not just giving answers. They also set up a reward system that encourages the AI to think carefully and follow logical steps. By practicing on these puzzles, the AI learned advanced reasoning skills like double-checking its work and summarizing its thoughts.

Why it matters?

This matters because it could make AI much better at solving complex problems in fields like math, science, and engineering. The AI learned these skills from just a few thousand practice puzzles and could even solve tough math problems it hadn't seen before. This approach could lead to smarter AI assistants that can help with more difficult tasks and explain their thinking, making them more useful and trustworthy for real-world applications.

Abstract

Inspired by the success of DeepSeek-R1, we explore the potential of rule-based reinforcement learning (RL) in large reasoning models. To analyze reasoning dynamics, we use synthetic logic puzzles as training data due to their controllable complexity and straightforward answer verification. We make some key technical contributions that lead to effective and stable RL training: a system prompt that emphasizes the thinking and answering process, a stringent format reward function that penalizes outputs for taking shortcuts, and a straightforward training recipe that achieves stable convergence. Our 7B model develops advanced reasoning skills-such as reflection, verification, and summarization-that are absent from the logic corpus. Remarkably, after training on just 5K logic problems, it demonstrates generalization abilities to the challenging math benchmarks AIME and AMC.

View Paper