Sharing is Caring: Efficient LM Post-Training with Collective RL Experience Sharing

Jeffrey Amico, Gabriel Passamani Andrade, John Donaghy, Ben Fielding, Tristin Forbus, Harry Grieve, Semih Kara, Jari Kolehmainen, Yihua Lou, Christopher Nies, Edward Phillip Flores Nuño, Diogo Ortega, Shikhar Rastogi, Austin Virts, Matthew J. Wright

2025-09-10

Sharing is Caring: Efficient LM Post-Training with Collective RL Experience Sharing

Summary

This paper introduces a new way to improve large language models (like those used for chatbots) after they've already been initially trained, using a technique called reinforcement learning. It focuses on making this improvement process more efficient and scalable, especially when using lots of computers at once.

What's the problem?

Normally, making language models better with reinforcement learning requires a lot of computing power and coordination between different computers. This creates problems like slow response times, high memory usage, and potential failures, and it gets really expensive as you try to use more and more computers to speed things up. Existing methods struggle to handle the complexities of a large, distributed system.

What's the solution?

The researchers developed a new algorithm called Swarm sAmpling Policy Optimization, or SAPO. It's designed to work on a network of computers without needing them to be perfectly synchronized or even have the same hardware. Each computer works independently, but they share information about their learning experiences with each other. This 'sharing' allows good ideas to spread quickly, helping all the models improve faster and avoid bottlenecks that happen when everything has to go through a central point.

Why it matters?

This work is important because it makes it much more practical to improve large language models using reinforcement learning. By reducing the technical challenges and costs of scaling up the process, it opens the door to creating even more powerful and capable AI systems. The decentralized approach also allows for contributions from many different sources, potentially accelerating innovation in the field.

Abstract

Post-training language models (LMs) with reinforcement learning (RL) can enhance their complex reasoning capabilities without supervised fine-tuning, as demonstrated by DeepSeek-R1-Zero. However, effectively utilizing RL for LMs requires significant parallelization to scale-up inference, which introduces non-trivial technical challenges (e.g. latency, memory, and reliability) alongside ever-growing financial costs. We present Swarm sAmpling Policy Optimization (SAPO), a fully decentralized and asynchronous RL post-training algorithm. SAPO is designed for decentralized networks of heterogenous compute nodes, where each node manages its own policy model(s) while "sharing" rollouts with others in the network; no explicit assumptions about latency, model homogeneity, or hardware are required and nodes can operate in silo if desired. As a result, the algorithm avoids common bottlenecks in scaling RL post-training while also allowing (and even encouraging) new possibilities. By sampling rollouts "shared" across the network, it enables "Aha moments" to propagate, thereby bootstrapping the learning process. In this paper we show SAPO achieved cumulative reward gains of up to 94% in controlled experiments. We also share insights from tests on a network with thousands of nodes contributed by Gensyn community members running the algorithm on diverse hardware and models during an open-source demo.

View Paper