GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

Shih-Yang Liu, Xin Dong, Ximing Lu, Shizhe Diao, Peter Belcak, Mingjie Liu, Min-Hung Chen, Hongxu Yin, Yu-Chiang Frank Wang, Kwang-Ting Cheng, Yejin Choi, Jan Kautz, Pavlo Molchanov

2026-01-09

GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

Summary

This paper investigates a problem with how we train powerful language models to follow multiple, sometimes conflicting, instructions at once using a technique called reinforcement learning. It proposes a new method to improve this training process, making the models better at doing what we want them to do.

What's the problem?

When training language models with multiple goals (like being helpful *and* concise), a common technique called Group Relative Policy Optimization, or GRPO, doesn't work as well as it should. The way GRPO normalizes the different rewards for each goal causes them to essentially become the same signal, meaning the model can't really tell which goals are more important or how well it's doing on each one. This leads to the model not learning effectively and sometimes even failing to train at all.

What's the solution?

The researchers developed a new method called Group reward-Decoupled Normalization Policy Optimization, or GDPO. GDPO fixes the problem with GRPO by treating each reward separately during the normalization process. This allows the model to clearly distinguish between different goals and learn to balance them more effectively, leading to more stable and successful training. They tested GDPO on tasks like using tools, solving math problems, and writing code.

Why it matters?

This research is important because as language models get more advanced, we want them to be able to handle complex instructions and preferences. GDPO provides a more reliable way to train these models to meet multiple objectives simultaneously, which is crucial for building AI systems that are truly helpful and aligned with human values. It improves the performance and stability of training, making it easier to create better AI assistants.

Abstract

As language models become increasingly capable, users expect them to provide not only accurate responses but also behaviors aligned with diverse human preferences across a variety of scenarios. To achieve this, Reinforcement learning (RL) pipelines have begun incorporating multiple rewards, each capturing a distinct preference, to guide models toward these desired behaviors. However, recent work has defaulted to apply Group Relative Policy Optimization (GRPO) under multi-reward setting without examining its suitability. In this paper, we demonstrate that directly applying GRPO to normalize distinct rollout reward combinations causes them to collapse into identical advantage values, reducing the resolution of the training signal and resulting in suboptimal convergence and, in some cases, early training failure. We then introduce Group reward-Decoupled Normalization Policy Optimization (GDPO), a new policy optimization method to resolve these issues by decoupling the normalization of individual rewards, more faithfully preserving their relative differences and enabling more accurate multi-reward optimization, along with substantially improved training stability. We compare GDPO with GRPO across three tasks: tool calling, math reasoning, and coding reasoning, evaluating both correctness metrics (accuracy, bug ratio) and constraint adherence metrics (format, length). Across all settings, GDPO consistently outperforms GRPO, demonstrating its effectiveness and generalizability for multi-reward reinforcement learning optimization.

View Paper