MarsRL: Advancing Multi-Agent Reasoning System via Reinforcement Learning with Agentic Pipeline Parallelism

Shulin Liu, Dong Du, Tao Yang, Yang Li, Boyu Qiu

2025-11-17

MarsRL: Advancing Multi-Agent Reasoning System via Reinforcement Learning with Agentic Pipeline Parallelism

Summary

This paper introduces a new way to improve how large language models (LLMs) solve complex problems by having multiple 'agents' work together, like a team. It focuses on making this team approach work well with open-source LLMs, which are freely available and customizable.

What's the problem?

Large language models are getting better at reasoning, especially with techniques like reinforcement learning and scaling up their size. However, they still struggle with problems that require a lot of thinking steps because they can only produce a limited amount of text at once. While using multiple agents – one to solve, one to check, and one to correct – works well in some powerful, closed-source models, it doesn't translate easily to open-source models because those models aren't as good at critically evaluating and fixing their own work.

What's the solution?

The researchers developed a framework called MarsRL. This framework trains all the agents in the team *together* using a special reinforcement learning approach. It gives each agent its own specific reward system to help them learn better and reduces confusing feedback. It also uses a training method inspired by how factories work, processing tasks in a pipeline to handle long, complex reasoning chains more efficiently. They tested this on an open-source model called Qwen3-30B-A3B-Thinking-2507.

Why it matters?

MarsRL significantly improved the accuracy of the Qwen3 model on challenging reasoning tasks, even allowing it to perform better than a much larger, more powerful version of the same model. This shows that it’s possible to build effective multi-agent reasoning systems using open-source LLMs, making advanced reasoning capabilities more accessible and customizable for a wider range of applications.

Abstract

Recent progress in large language models (LLMs) has been propelled by reinforcement learning with verifiable rewards (RLVR) and test-time scaling. However, the limited output length of LLMs constrains the depth of reasoning attainable in a single inference process. Multi-agent reasoning systems offer a promising alternative by employing multiple agents including Solver, Verifier, and Corrector, to iteratively refine solutions. While effective in closed-source models like Gemini 2.5 Pro, they struggle to generalize to open-source models due to insufficient critic and correction capabilities. To address this, we propose MarsRL, a novel reinforcement learning framework with agentic pipeline parallelism, designed to jointly optimize all agents in the system. MarsRL introduces agent-specific reward mechanisms to mitigate reward noise and employs pipeline-inspired training to enhance efficiency in handling long trajectories. Applied to Qwen3-30B-A3B-Thinking-2507, MarsRL improves AIME2025 accuracy from 86.5% to 93.3% and BeyondAIME from 64.9% to 73.8%, even surpassing Qwen3-235B-A22B-Thinking-2507. These findings highlight the potential of MarsRL to advance multi-agent reasoning systems and broaden their applicability across diverse reasoning tasks.

View Paper