FAPO: Flawed-Aware Policy Optimization for Efficient and Reliable Reasoning
Yuyang Ding, Chi Zhang, Juntao Li, Haibin Lin, Xin Liu, Min Zhang
2025-10-30
Summary
This paper investigates a problem with how we're teaching large language models (LLMs) to 'think' using a method called reinforcement learning with verifiable rewards. Basically, LLMs are getting better at reasoning by trying different approaches and getting rewarded when they arrive at the right answer, but they can sometimes 'cheat' and still get rewarded, which ultimately hurts their ability to truly reason.
What's the problem?
When LLMs are learning through trial and error with rewards, they sometimes find shortcuts to the correct answer that aren't actually good reasoning. For example, they might guess the answer or jump to a conclusion without showing their work. The problem is that these 'lucky guesses' get rewarded just as much as a correct, well-reasoned answer. This means the model learns to rely on these unreliable methods, limiting its potential for real reasoning skills. While these shortcuts help the model learn quickly at first, they hinder its progress later on.
What's the solution?
The researchers developed a new technique called Flawed-Aware Policy Optimization (FAPO). This method doesn't require any extra settings or adjustments – it automatically identifies and slightly penalizes these 'flawed' reasoning paths, like lucky guesses. FAPO allows the model to use these shortcuts to learn quickly at the beginning of training, but then gradually encourages it to focus on more reliable reasoning as it gets better. They also created a 'generative reward model' that can pinpoint exactly where errors occur in the reasoning process, helping to identify flawed rollouts more accurately.
Why it matters?
This work is important because it addresses a key limitation in how we're training LLMs to reason. By preventing models from relying on unreliable shortcuts, we can build AI systems that are not only accurate but also capable of truly understanding and solving problems. This leads to more trustworthy and robust AI, improving performance, stability during training, and the overall quality of the reasoning process without requiring more computing power.
Abstract
Reinforcement learning with verifiable rewards (RLVR) has emerged as a promising paradigm for enhancing the reasoning capabilities of large language models (LLMs). In this context, models explore reasoning trajectories and exploit rollouts with correct answers as positive signals for policy optimization. However, these rollouts might involve flawed patterns such as answer-guessing and jump-in-reasoning. Such flawed-positive rollouts are rewarded identically to fully correct ones, causing policy models to internalize these unreliable reasoning patterns. In this work, we first conduct a systematic study of flawed-positive rollouts in RL and find that they enable rapid capability gains during the early optimization stage, while constraining reasoning capability later by reinforcing unreliable patterns. Building on these insights, we propose Flawed-Aware Policy Optimization (FAPO), which presents a parameter-free reward penalty for flawed-positive rollouts, enabling the policy to leverage them as useful shortcuts in the warm-up stage, securing stable early gains, while gradually shifting optimization toward reliable reasoning in the later refinement stage. To accurately and comprehensively detect flawed-positive rollouts, we introduce a generative reward model (GenRM) with a process-level reward that precisely localizes reasoning errors. Experiments show that FAPO is effective in broad domains, improving outcome correctness, process reliability, and training stability without increasing the token budget.