Hybrid Reinforcement: When Reward Is Sparse, It's Better to Be Dense

Leitian Tao, Ilia Kulikov, Swarnadeep Saha, Tianlu Wang, Jing Xu, Yixuan Li, Jason E Weston, Ping Yu

2025-10-10

Hybrid Reinforcement: When Reward Is Sparse, It's Better to Be Dense

Summary

This paper explores how to improve the reasoning abilities of large language models, like the ones powering chatbots, by combining two different types of feedback during their training.

What's the problem?

Currently, training these models often uses simple 'correct or incorrect' feedback from automated checkers. While reliable, this is too strict because many problems have multiple valid solutions or partial credit is deserved. Relying solely on this 'all or nothing' feedback limits how much the model can learn and improve, especially on complex tasks. Reward models, which give more nuanced scores, can help, but they aren't always as trustworthy as the checkers.

What's the solution?

The researchers developed a new training method called HERO, which stands for Hybrid Ensemble Reward Optimization. HERO cleverly blends the strict 'correct/incorrect' signals from the checkers with the more detailed scores from reward models. It does this by first making sure the reward model's scores align with the checker's judgments – meaning it won't reward incorrect answers highly. Then, it focuses on using the reward model's finer distinctions when the problems are particularly challenging, where more detailed feedback is most valuable. Essentially, it uses the best of both worlds.

Why it matters?

This work is important because it shows a way to make large language models better at reasoning without sacrificing their reliability. By combining the stability of automated checkers with the nuanced feedback of reward models, HERO allows these models to learn more effectively and perform better on a wider range of complex reasoning tasks, including those that are difficult to automatically verify.

Abstract

Post-training for reasoning of large language models (LLMs) increasingly relies on verifiable rewards: deterministic checkers that provide 0-1 correctness signals. While reliable, such binary feedback is brittle--many tasks admit partially correct or alternative answers that verifiers under-credit, and the resulting all-or-nothing supervision limits learning. Reward models offer richer, continuous feedback, which can serve as a complementary supervisory signal to verifiers. We introduce HERO (Hybrid Ensemble Reward Optimization), a reinforcement learning framework that integrates verifier signals with reward-model scores in a structured way. HERO employs stratified normalization to bound reward-model scores within verifier-defined groups, preserving correctness while refining quality distinctions, and variance-aware weighting to emphasize challenging prompts where dense signals matter most. Across diverse mathematical reasoning benchmarks, HERO consistently outperforms RM-only and verifier-only baselines, with strong gains on both verifiable and hard-to-verify tasks. Our results show that hybrid reward design retains the stability of verifiers while leveraging the nuance of reward models to advance reasoning.

View Paper