Manifold-Aware Exploration for Reinforcement Learning in Video Generation

Mingzhe Zheng, Weijie Kong, Yue Wu, Dengyang Jiang, Yue Ma, Xuanhua He, Bin Lin, Kaixiong Gong, Zhao Zhong, Liefeng Bo, Qifeng Chen, Harry Yang

2026-03-24

Manifold-Aware Exploration for Reinforcement Learning in Video Generation

Summary

This paper focuses on improving how we create videos using artificial intelligence, specifically a technique called Group Relative Policy Optimization (GRPO). Currently, GRPO works really well for things like generating text and images, but it struggles with video because videos are much more complex.

What's the problem?

Generating realistic and consistent videos is hard for GRPO because the process of exploring different possibilities introduces a lot of random noise. This noise messes up the quality of the videos being created and makes it difficult for the AI to learn what makes a 'good' video, ultimately leading to unstable results and videos that don't quite look right.

What's the solution?

The researchers came up with a new method called SAGE-GRPO, which stands for Stable Alignment via Exploration. The core idea is to keep the AI's exploration focused on creating videos that are similar to real videos it has already learned from. They do this in two ways: first, by carefully controlling the amount of randomness introduced during video creation, and second, by making sure the AI doesn't drift too far away from the 'realistic video' patterns it already knows. They use mathematical techniques to refine the exploration process and keep it stable over longer video sequences.

Why it matters?

This research is important because it brings video generation AI closer to the quality of text and image generation AI. Better video generation has a lot of potential applications, from creating special effects in movies to generating personalized content and even helping with scientific visualization. By making the process more reliable and producing higher-quality videos, SAGE-GRPO is a step towards unlocking those possibilities.

Abstract

Group Relative Policy Optimization (GRPO) methods for video generation like FlowGRPO remain far less reliable than their counterparts for language models and images. This gap arises because video generation has a complex solution space, and the ODE-to-SDE conversion used for exploration can inject excess noise, lowering rollout quality and making reward estimates less reliable, which destabilizes post-training alignment. To address this problem, we view the pre-trained model as defining a valid video data manifold and formulate the core problem as constraining exploration within the vicinity of this manifold, ensuring that rollout quality is preserved and reward estimates remain reliable. We propose SAGE-GRPO (Stable Alignment via Exploration), which applies constraints at both micro and macro levels. At the micro level, we derive a precise manifold-aware SDE with a logarithmic curvature correction and introduce a gradient norm equalizer to stabilize sampling and updates across timesteps. At the macro level, we use a dual trust region with a periodic moving anchor and stepwise constraints so that the trust region tracks checkpoints that are closer to the manifold and limits long-horizon drift. We evaluate SAGE-GRPO on HunyuanVideo1.5 using the original VideoAlign as the reward model and observe consistent gains over previous methods in VQ, MQ, TA, and visual metrics (CLIPScore, PickScore), demonstrating superior performance in both reward maximization and overall video quality. The code and visual gallery are available at https://dungeonmassster.github.io/SAGE-GRPO-Page/.

View Paper