The Majority is not always right: RL training for solution aggregation

Wenting Zhao, Pranjal Aggarwal, Swarnadeep Saha, Asli Celikyilmaz, Jason Weston, Ilia Kulikov

2025-09-11

The Majority is not always right: RL training for solution aggregation

Summary

This paper explores a better way to combine multiple answers generated by large language models (LLMs) to get more accurate results on difficult problems.

What's the problem?

Large language models sometimes struggle with complex reasoning. A common approach to improve their performance is to have them generate several different answers to the same question and then pick the best one. However, simply choosing the most frequent answer (majority voting) or using another model to rank the answers isn't always effective because the correct answer might be a less common, but still valid, solution.

What's the solution?

The researchers developed a new method called AggLM that *learns* how to intelligently combine these multiple answers. Instead of just picking or ranking, AggLM acts like a reviewer, carefully looking at each potential answer, identifying any disagreements, and then creating a final, synthesized answer. They trained this 'aggregator' model using reinforcement learning, rewarding it when it produced verifiable correct answers. A key part of their training was making sure the model saw both easy problems where the majority answer is correct, and hard problems where the correct answer might be less common.

Why it matters?

This work is important because it provides a more sophisticated way to leverage the power of generating multiple solutions from LLMs. AggLM outperforms existing methods, works well even when combining answers from different and even *better* models than it was trained with, and does so efficiently, requiring fewer computational resources than simply using majority voting with a large number of generated answers. This means we can get more accurate results from LLMs without needing massive amounts of computing power.

Abstract

Scaling up test-time compute, by generating multiple independent solutions and selecting or aggregating among them, has become a central paradigm for improving large language models (LLMs) on challenging reasoning tasks. While most prior work relies on simple majority voting or reward model ranking to aggregate solutions, these approaches may only yield limited benefits. In this work, we propose to learn aggregation as an explicit reasoning skill: given a set of candidate solutions, we train an aggregator model to review, reconcile, and synthesize a final, correct answer using reinforcement learning from verifiable rewards. A key ingredient is careful balancing of easy and hard training examples, allowing the model to learn both to recover minority-but-correct answers as well as easy majority-correct answers. Empirically, we find our method, AggLM, outperforms both strong rule-based and reward-model baselines, across multiple benchmarks. Furthermore, it generalizes effectively to solutions from differing models, including stronger ones than contained in the training data, all while requiring substantially fewer tokens than majority voting with larger numbers of solutions.

View Paper