RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework

Hao Gao, Shaoyu Chen, Yifan Zhu, Yuehao Song, Wenyu Liu, Qian Zhang, Xinggang Wang

2026-04-17

RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework

Summary

This paper introduces a new system called RAD-2 for making self-driving cars better at planning their routes, especially in tricky situations with lots of uncertainty.

What's the problem?

Current self-driving systems struggle with predicting what other cars and pedestrians will do, leading to jerky or unsafe movements. Existing planning methods using 'diffusion' models are good at coming up with many possible routes, but they can be unstable and don't always learn from their mistakes effectively when simply copying human drivers. They also have trouble making corrections during the driving process.

What's the solution?

RAD-2 uses two main parts working together. First, a 'generator' creates a bunch of potential routes. Then, a 'discriminator,' trained using reinforcement learning, evaluates those routes based on how well the car would actually do driving them in the long run. This avoids the problem of trying to directly reward the generator for every tiny step, which is hard to do. They also developed new techniques to help the system learn more efficiently from its experiences, focusing on how actions relate to each other over time and providing clear signals to improve route planning. Finally, they created a fast simulation environment to test the system on a large scale.

Why it matters?

RAD-2 significantly reduces the number of collisions in simulations – by 56% compared to other advanced systems. Real-world tests show that cars using RAD-2 feel safer and drive more smoothly in busy city traffic, which is a big step towards making fully self-driving cars a reality.

Abstract

High-level autonomous driving requires motion planners capable of modeling multimodal future uncertainties while remaining robust in closed-loop interactions. Although diffusion-based planners are effective at modeling complex trajectory distributions, they often suffer from stochastic instabilities and the lack of corrective negative feedback when trained purely with imitation learning. To address these issues, we propose RAD-2, a unified generator-discriminator framework for closed-loop planning. Specifically, a diffusion-based generator is used to produce diverse trajectory candidates, while an RL-optimized discriminator reranks these candidates according to their long-term driving quality. This decoupled design avoids directly applying sparse scalar rewards to the full high-dimensional trajectory space, thereby improving optimization stability. To further enhance reinforcement learning, we introduce Temporally Consistent Group Relative Policy Optimization, which exploits temporal coherence to alleviate the credit assignment problem. In addition, we propose On-policy Generator Optimization, which converts closed-loop feedback into structured longitudinal optimization signals and progressively shifts the generator toward high-reward trajectory manifolds. To support efficient large-scale training, we introduce BEV-Warp, a high-throughput simulation environment that performs closed-loop evaluation directly in Bird's-Eye View feature space via spatial warping. RAD-2 reduces the collision rate by 56% compared with strong diffusion-based planners. Real-world deployment further demonstrates improved perceived safety and driving smoothness in complex urban traffic.

View Paper