GoRL: An Algorithm-Agnostic Framework for Online Reinforcement Learning with Generative Policies

Chubin Zhang, Zhenglin Wan, Feng Chen, Xingrui Yu, Ivor Tsang, Bo An

2025-12-03

GoRL: An Algorithm-Agnostic Framework for Online Reinforcement Learning with Generative Policies

Summary

This paper introduces a new approach to reinforcement learning called GoRL, which aims to create more capable and stable AI agents for complex tasks involving continuous actions, like controlling a robot's movements.

What's the problem?

Reinforcement learning struggles with finding the right balance between stability and expressiveness. Simple policies that are easy to optimize often can't handle the variety of actions needed for difficult tasks. More complex policies, like those using diffusion models, *can* represent a wider range of actions, but they're hard to train because the process of generating actions is unstable and creates noisy signals for the AI to learn from.

What's the solution?

GoRL solves this by separating how the AI learns its strategy from how it actually chooses actions. It uses a simpler, easily optimized 'latent policy' to make high-level decisions, and then a separate 'generative decoder' translates those decisions into specific actions. The latent policy is updated frequently for stability, while the decoder is updated less often to gradually improve its ability to generate diverse and effective actions. This avoids the instability issues of directly optimizing the action generation process.

Why it matters?

This research is important because it provides a practical way to build AI agents that are both reliable and capable of performing complex tasks. By decoupling optimization and generation, GoRL achieves significantly better performance than existing methods on challenging control problems, suggesting a promising path forward for developing more advanced and versatile reinforcement learning systems.

Abstract

Reinforcement learning (RL) faces a persistent tension: policies that are stable to optimize are often too simple to represent the multimodal action distributions needed for complex control. Gaussian policies provide tractable likelihoods and smooth gradients, but their unimodal form limits expressiveness. Conversely, generative policies based on diffusion or flow matching can model rich multimodal behaviors; however, in online RL, they are frequently unstable due to intractable likelihoods and noisy gradients propagating through deep sampling chains. We address this tension with a key structural principle: decoupling optimization from generation. Building on this insight, we introduce GoRL (Generative Online Reinforcement Learning), a framework that optimizes a tractable latent policy while utilizing a conditional generative decoder to synthesize actions. A two-timescale update schedule enables the latent policy to learn stably while the decoder steadily increases expressiveness, without requiring tractable action likelihoods. Across a range of continuous-control tasks, GoRL consistently outperforms both Gaussian policies and recent generative-policy baselines. Notably, on the HopperStand task, it reaches a normalized return above 870, more than 3 times that of the strongest baseline. These results demonstrate that separating optimization from generation provides a practical path to policies that are both stable and highly expressive.

View Paper