Reasoning-Aware GRPO using Process Mining
Taekhyun Park, Yongjae Lee, Hyerim Bae
2025-10-30
Summary
This paper focuses on improving how large language models perform complex reasoning tasks, specifically by refining how they're 'trained' *after* their initial training is complete.
What's the problem?
Currently, when these models are fine-tuned using reinforcement learning, the reward system mainly focuses on whether the final answer is correct. This doesn't really help the model learn *how* to reason properly, just to guess the right answer. It's like teaching someone to solve math problems by only telling them if the final answer is right or wrong, without checking their work.
What's the solution?
The researchers developed a new method called PM4GRPO. This method doesn't just look at the final answer; it also checks how the model arrives at that answer. They used a technique called 'process mining' to compare the model's reasoning steps to those of a highly skilled, pre-trained model. If the model's reasoning process closely matches the expert's, it gets an additional reward, encouraging it to think in a similar, effective way.
Why it matters?
This is important because it shows that teaching a model *how* to think, not just *what* to think, significantly improves its reasoning abilities. By focusing on the reasoning process itself, they were able to create models that perform much better on complex reasoning tasks, which is a crucial step towards building more intelligent and reliable AI systems.
Abstract
Reinforcement learning (RL)-based post-training has been crucial for enabling multi-step reasoning in large reasoning models (LRMs), yet current reward schemes are typically outcome-centric. We propose PM4GRPO, a reasoning-aware Group Relative Policy Optimization (GRPO) that augments standard answer/format rewards with signals over the reasoning procedure. To this end, process mining techniques are utilized to compute a scalar conformance reward that measures how closely a policy model's reasoning aligns with the pretrained teacher model. The empirical results on five benchmarks demonstrate that PM4GRPO significantly outperforms existing methodologies for GRPO-based post-training. These results highlight that leveraging process mining for reasoning-aware GRPO effectively enhances the reasoning capabilities of policy models.