Learning to Optimize Multi-Objective Alignment Through Dynamic Reward Weighting

Yining Lu, Zilong Wang, Shiyang Li, Xin Liu, Changlong Yu, Qingyu Yin, Zhan Shi, Zixuan Zhang, Meng Jiang

2025-09-16

Learning to Optimize Multi-Objective Alignment Through Dynamic Reward Weighting

Summary

This paper tackles a challenge in teaching AI, specifically large language models, to balance multiple goals at once, like being both helpful and harmless.

What's the problem?

When trying to get an AI to do multiple things well, a common technique is to combine all the goals into a single score using weights. However, this method often fails when the relationship between the AI’s settings and its performance on each goal is complex and curvy, not simple and straight. This is especially true with large language models because their behavior changes in unpredictable ways as they learn, making it hard to find the right balance with fixed weights.

What's the solution?

The researchers developed a new approach called 'dynamic reward weighting'. Instead of setting the weights for each goal once at the beginning and leaving them alone, their system *adjusts* those weights during the learning process. It continuously figures out which goals need more focus to improve overall performance, exploring different trade-offs between goals more effectively. They created two different ways to do this adjustment, one based on measuring how well the AI covers all the goals and another using the AI’s own learning signals.

Why it matters?

This work is important because it allows us to better align large language models with our intentions. By allowing the AI to dynamically balance competing objectives, we can achieve better results than previous methods, leading to AI systems that are more capable and safer to use. It also works with existing AI learning methods and across different types of reasoning tasks.

Abstract

Prior works in multi-objective reinforcement learning typically use linear reward scalarization with fixed weights, which provably fail to capture non-convex Pareto fronts and thus yield suboptimal results. This limitation becomes especially critical in online preference alignment for large language models. Here, stochastic trajectories generated by parameterized policies create highly non-linear and non-convex mappings from parameters to objectives that no single static weighting scheme can find optimal trade-offs. We address this limitation by introducing dynamic reward weighting, which adaptively adjusts reward weights during the online reinforcement learning process. Unlike existing approaches that rely on fixed-weight interpolation, our dynamic weighting continuously balances and prioritizes objectives in training, facilitating effective exploration of Pareto fronts in objective space. We introduce two approaches of increasing sophistication and generalizability: (1) hypervolume-guided weight adaptation and (2) gradient-based weight optimization, offering a versatile toolkit for online multi-objective alignment. Our extensive experiments demonstrate their compatibility with commonly used online reinforcement learning algorithms (including GRPO, REINFORCE, and RLOO), effectiveness across multiple mathematical reasoning datasets, and applicability to different model families, consistently achieving Pareto dominant solutions with fewer training steps than fixed-weight linear scalarization baselines.

View Paper