Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling
Haoyu Wu, Diankun Wu, Tianyu He, Junliang Guo, Yang Ye, Yueqi Duan, Jiang Bian
2025-07-11
Summary
This paper talks about Geometry Forcing, a new method for helping video diffusion models better understand and create videos that look consistent with real 3D worlds.
What's the problem?
When video diffusion AIs are trained only on regular videos, they usually just learn how to make good 2D images and don't really get the 3D shapes and spaces that are behind what we see, so their videos can look weird or impossible when the camera moves.
What's the solution?
The researchers fixed this by making the model's middle steps match up with 3D features learned from another AI that already knows about 3D geometry. They did this by guiding the model to keep both the direction and the size of 3D details right as it creates the video, so its understanding lines up with how 3D objects really work.
Why it matters?
This matters because it means future video AIs can make much more realistic and steady videos—like walking around a room or turning objects—since the AI now understands not just images, but also the real 3D world structure behind them.
Abstract
Geometry Forcing enhances video diffusion models by aligning their intermediate representations with geometric features from a pretrained model, improving visual quality and 3D consistency.