VGGRPO: Towards World-Consistent Video Generation with 4D Latent Reward
Zhaochong An, Orest Kupyn, Théo Uscidda, Andrea Colaco, Karan Ahuja, Serge Belongie, Mar Gonzalez-Franco, Marta Tintore Gazulla
2026-04-01
Summary
This paper addresses the issue of creating realistic videos with AI, specifically focusing on making sure the geometry and movement within the videos look correct and consistent.
What's the problem?
Current AI models that generate videos are really good at making things *look* visually appealing, but they often struggle with making sure objects and cameras move in a physically plausible way. Previous attempts to fix this either required changing the core structure of the AI, which could make it less versatile, or involved a lot of extra calculations that were slow and didn't work well with fast-moving scenes.
What's the solution?
The researchers developed a new method called VGGRPO that improves video consistency without altering the original AI model. It works by connecting the AI's internal representation of the video to a separate AI that understands 3D geometry. This allows the system to directly create accurate 3D scenes from the video's data, even if things are moving quickly. They also used a technique called reinforcement learning, rewarding the AI for smooth camera movements and consistent geometry across different viewpoints.
Why it matters?
This research is important because it allows for the creation of more realistic and immersive videos using AI. By making the videos geometrically sound and stable, it opens up possibilities for applications like virtual reality, special effects, and even robotics where understanding the 3D world is crucial. Plus, it does so efficiently, avoiding the slow and computationally expensive methods used before.
Abstract
Large-scale video diffusion models achieve impressive visual quality, yet often fail to preserve geometric consistency. Prior approaches improve consistency either by augmenting the generator with additional modules or applying geometry-aware alignment. However, architectural modifications can compromise the generalization of internet-scale pretrained models, while existing alignment methods are limited to static scenes and rely on RGB-space rewards that require repeated VAE decoding, incurring substantial compute overhead and failing to generalize to highly dynamic real-world scenes. To preserve the pretrained capacity while improving geometric consistency, we propose VGGRPO (Visual Geometry GRPO), a latent geometry-guided framework for geometry-aware video post-training. VGGRPO introduces a Latent Geometry Model (LGM) that stitches video diffusion latents to geometry foundation models, enabling direct decoding of scene geometry from the latent space. By constructing LGM from a geometry model with 4D reconstruction capability, VGGRPO naturally extends to dynamic scenes, overcoming the static-scene limitations of prior methods. Building on this, we perform latent-space Group Relative Policy Optimization with two complementary rewards: a camera motion smoothness reward that penalizes jittery trajectories, and a geometry reprojection consistency reward that enforces cross-view geometric coherence. Experiments on both static and dynamic benchmarks show that VGGRPO improves camera stability, geometry consistency, and overall quality while eliminating costly VAE decoding, making latent-space geometry-guided reinforcement an efficient and flexible approach to world-consistent video generation.