Multi-Task GRPO: Reliable LLM Reasoning Across Tasks

Shyam Sundhar Ramesh, Xiaotong Ji, Matthieu Zimmer, Sangwoong Yoon, Zhiyong Wang, Haitham Bou Ammar, Aurelien Lucchi, Ilija Bogunovic

2026-02-06

Multi-Task GRPO: Reliable LLM Reasoning Across Tasks

Summary

This paper focuses on improving how large language models perform when asked to do many different reasoning tasks, moving beyond just making them good at one specific task.

What's the problem?

When you try to train a language model to be good at multiple reasoning tasks at once using a common technique called GRPO, some tasks get much better than others. The model focuses on the easier tasks and ignores the harder ones. Also, some tasks don't give the model much feedback during training, making it even harder for those tasks to improve. This leads to unreliable performance overall, because the model might be great at some things but terrible at others.

What's the solution?

The researchers developed a new method called Multi-Task GRPO (MT-GRPO). This method dynamically adjusts how much attention the model pays to each task, specifically trying to improve the *worst* performing tasks to create a more balanced outcome. It also makes sure the feedback the model gets from each task accurately reflects the importance assigned to it, even if some tasks rarely provide useful feedback. Essentially, it's a smarter way to balance the training process across multiple tasks.

Why it matters?

This work is important because it makes large language models more reliable and useful in real-world situations where they need to handle a variety of different problems. MT-GRPO not only improves performance on the hardest tasks but also does so more efficiently, requiring fewer training steps to achieve good results across the board. This means we can get more consistent and dependable performance from these models.

Abstract

RL-based post-training with GRPO is widely used to improve large language models on individual reasoning tasks. However, real-world deployment requires reliable performance across diverse tasks. A straightforward multi-task adaptation of GRPO often leads to imbalanced outcomes, with some tasks dominating optimization while others stagnate. Moreover, tasks can vary widely in how frequently prompts yield zero advantages (and thus zero gradients), which further distorts their effective contribution to the optimization signal. To address these issues, we propose a novel Multi-Task GRPO (MT-GRPO) algorithm that (i) dynamically adapts task weights to explicitly optimize worst-task performance and promote balanced progress across tasks, and (ii) introduces a ratio-preserving sampler to ensure task-wise policy gradients reflect the adapted weights. Experiments on both 3-task and 9-task settings show that MT-GRPO consistently outperforms baselines in worst-task accuracy. In particular, MT-GRPO achieves 16-28% and 6% absolute improvement on worst-task performance over standard GRPO and DAPO, respectively, while maintaining competitive average accuracy. Moreover, MT-GRPO requires 50% fewer training steps to reach 50% worst-task accuracy in the 3-task setting, demonstrating substantially improved efficiency in achieving reliable performance across tasks.

View Paper