StepWiser: Stepwise Generative Judges for Wiser Reasoning

Wei Xiong, Wenting Zhao, Weizhe Yuan, Olga Golovneva, Tong Zhang, Jason Weston, Sainbayar Sukhbaatar

2025-08-28

StepWiser: Stepwise Generative Judges for Wiser Reasoning

Summary

This paper focuses on how to better train AI models that need to think through problems in multiple steps, like solving a complex math problem or writing a long story.

What's the problem?

When AI models tackle complicated tasks, they often break them down into smaller steps, but it's hard to make sure each step is logically correct. Current methods for checking these steps are like simple 'yes' or 'no' answers, and they don't explain *why* a step is good or bad. Also, these methods need a lot of pre-labeled examples to learn from, which limits how well they work on new, different kinds of problems.

What's the solution?

The researchers came up with a new approach called StepWiser. Instead of just classifying each step as right or wrong, StepWiser *reasons* about the model's thinking process. It essentially thinks about the thinking, generating 'thinking tokens' to explain its judgment before giving a final score. This 'generative judge' is trained using reinforcement learning, meaning it learns by trying to predict which steps lead to better overall results. It doesn't need a huge, pre-labeled dataset.

Why it matters?

This work is important because it improves the accuracy of evaluating intermediate steps in complex AI tasks. It also helps the AI model learn and improve *while* it's being trained, and it makes the AI better at finding the best solution when it's actually solving a problem. Ultimately, this leads to more reliable and capable AI systems that can handle more challenging tasks.

Abstract

As models increasingly leverage multi-step reasoning strategies to solve complex problems, supervising the logical validity of these intermediate steps has become a critical research challenge. Process reward models address this by providing step-by-step feedback, but current approaches have two major drawbacks: they typically function as classifiers without providing explanations, and their reliance on supervised fine-tuning with static datasets limits generalization. Inspired by recent advances, we reframe stepwise reward modeling from a classification task to a reasoning task itself. We thus propose a generative judge that reasons about the policy model's reasoning steps (i.e., meta-reasons), outputting thinking tokens before delivering a final verdict. Our model, StepWiser, is trained by reinforcement learning using relative outcomes of rollouts. We show it provides (i) better judgment accuracy on intermediate steps than existing methods; (ii) can be used to improve the policy model at training time; and (iii) improves inference-time search.

View Paper