ArenaRL: Scaling RL for Open-Ended Agents via Tournament-based Relative Ranking

Qiang Zhang, Boli Chen, Fanrui Zhang, Ruixue Ding, Shihang Wang, Qiuchen Wang, Yinfeng Huang, Haonan Zhang, Rongxiang Zhu, Pengyong Wang, Ailin Ren, Xin Li, Pengjun Xie, Jiawei Liu, Ning Guo, Jingren Zhou, Zheng-Jun Zha

2026-01-14

ArenaRL: Scaling RL for Open-Ended Agents via Tournament-based Relative Ranking

Summary

This paper focuses on improving how we train AI agents, specifically large language models (LLMs), to handle complex, open-ended tasks like planning a detailed trip or doing in-depth research. It addresses the difficulty of teaching these agents when there isn't one single 'right' answer.

What's the problem?

When training LLM agents using reinforcement learning for tasks without clear-cut answers, we usually rely on 'reward models' that give a score to each attempt. The problem is these reward models often struggle to tell the difference between good and slightly better attempts, essentially squashing all the scores together. This makes it hard for the agent to learn because the feedback it gets is noisy and doesn't clearly show which actions are actually improving performance, leading to the agent getting stuck and not improving much.

What's the solution?

The researchers introduce a new training method called ArenaRL. Instead of giving each attempt a single score, ArenaRL focuses on *ranking* different attempts against each other. It's like a tournament where the agent's responses compete. They use detailed guidelines to evaluate responses and create a system where the agent learns from which responses are preferred over others. They also developed a clever way to do this ranking efficiently, so it doesn't take too long to compare all the options. Finally, they created two new challenging benchmarks, Open-Travel and Open-DeepResearch, to properly test these kinds of agents.

Why it matters?

This work is important because it helps LLM agents become much better at tackling complex, real-world problems that don't have simple solutions. By improving the way we provide feedback during training, ArenaRL allows agents to learn more effectively and generate more reliable and useful results for tasks like planning trips or conducting research, ultimately making these AI systems more helpful and capable.

Abstract

Reinforcement learning has substantially improved the performance of LLM agents on tasks with verifiable outcomes, but it still struggles on open-ended agent tasks with vast solution spaces (e.g., complex travel planning). Due to the absence of objective ground-truth for these tasks, current RL algorithms largely rely on reward models that assign scalar scores to individual responses. We contend that such pointwise scoring suffers from an inherent discrimination collapse: the reward model struggles to distinguish subtle advantages among different trajectories, resulting in scores within a group being compressed into a narrow range. Consequently, the effective reward signal becomes dominated by noise from the reward model, leading to optimization stagnation. To address this, we propose ArenaRL, a reinforcement learning paradigm that shifts from pointwise scalar scoring to intra-group relative ranking. ArenaRL introduces a process-aware pairwise evaluation mechanism, employing multi-level rubrics to assign fine-grained relative scores to trajectories. Additionally, we construct an intra-group adversarial arena and devise a tournament-based ranking scheme to obtain stable advantage signals. Empirical results confirm that the built seeded single-elimination scheme achieves nearly equivalent advantage estimation accuracy to full pairwise comparisons with O(N^2) complexity, while operating with only O(N) complexity, striking an optimal balance between efficiency and precision. Furthermore, to address the lack of full-cycle benchmarks for open-ended agents, we build Open-Travel and Open-DeepResearch, two high-quality benchmarks featuring a comprehensive pipeline covering SFT, RL training, and multi-dimensional evaluation. Extensive experiments show that ArenaRL substantially outperforms standard RL baselines, enabling LLM agents to generate more robust solutions for complex real-world tasks.

View Paper