World-R1: Reinforcing 3D Constraints for Text-to-Video Generation
Weijie Wang, Xiaoxuan He, Youping Gu, Yifan Yang, Zeyu Zhang, Yefei He, Yanbo Ding, Xirui Hu, Donny Y. Chen, Zhiyuan He, Yuqing Yang, Bohan Zhuang
2026-04-28
Summary
This paper introduces a new method, called World-R1, to make videos generated by artificial intelligence look more realistic by ensuring the objects and scenes within them follow the rules of 3D space.
What's the problem?
Current AI models that create videos are really good at making things *look* visually appealing, but they often struggle with getting the geometry right – things might appear distorted, objects might float, or perspectives might be off, making the videos feel unnatural. Previous attempts to fix this involved changing the core structure of the AI, which was expensive and limited how well it could scale to create more complex videos.
What's the solution?
The researchers used a technique called reinforcement learning to 'train' the video-generating AI to better understand and follow 3D rules. They created a special dataset of text descriptions focused on how things work in the real world to help with this training. Importantly, they didn't change the AI's basic design; instead, they used feedback from other AI models that *are* good at understanding 3D and vision-language tasks to guide the video generator. They also used a clever training schedule that alternates between focusing on strict 3D accuracy and allowing for more natural, flowing movement.
Why it matters?
This work is important because it allows for the creation of more believable and immersive videos without sacrificing the quality or efficiency of the AI model. It’s a step towards being able to generate complex, realistic virtual worlds that are consistent and don't have those jarring geometric errors, bridging the gap between simply making pretty pictures and actually simulating a world.
Abstract
Recent video foundation models demonstrate impressive visual synthesis but frequently suffer from geometric inconsistencies. While existing methods attempt to inject 3D priors via architectural modifications, they often incur high computational costs and limit scalability. We propose World-R1, a framework that aligns video generation with 3D constraints through reinforcement learning. To facilitate this alignment, we introduce a specialized pure text dataset tailored for world simulation. Utilizing Flow-GRPO, we optimize the model using feedback from pre-trained 3D foundation models and vision-language models to enforce structural coherence without altering the underlying architecture. We further employ a periodic decoupled training strategy to balance rigid geometric consistency with dynamic scene fluidity. Extensive evaluations reveal that our approach significantly enhances 3D consistency while preserving the original visual quality of the foundation model, effectively bridging the gap between video generation and scalable world simulation.