Inference-time Physics Alignment of Video Generative Models with Latent World Models

Jianhao Yuan, Xiaofeng Zhang, Felix Friedrich, Nicolas Beltran-Velez, Melissa Hall, Reyhane Askari-Hemmat, Xiaochuang Han, Nicolas Ballas, Michal Drozdzal, Adriana Romero-Soriano

2026-01-16

Inference-time Physics Alignment of Video Generative Models with Latent World Models

Summary

This paper addresses the issue of videos created by artificial intelligence often looking unrealistic because they don't follow the basic rules of physics, like how objects should fall or move.

What's the problem?

Current AI models that generate videos are really good at making things *look* visually appealing, but they frequently create scenes that are physically impossible or don't make sense in the real world. This isn't just because the AI hasn't 'learned' enough about physics from the data it was trained on; it's also because of *how* the AI creates the videos in the first place – the process it uses to build each frame isn't optimized for realistic physics.

What's the solution?

The researchers introduced a system called WMReward. Think of it like giving the AI a 'physics grade' as it's making the video. They used a pre-existing AI model that *does* understand physics well (called VJEPA-2) to evaluate each possible version of the video as it's being created. The AI then tries out different ways to generate the video, choosing the versions that get the best 'physics grade' from VJEPA-2. This means using more computing power during video creation to explore more options and ultimately make a more realistic video.

Why it matters?

This work is important because it shows a way to significantly improve the realism of AI-generated videos. They actually won a competition (the ICCV 2025 Perception Test PhysicsIQ Challenge) by a large margin, proving their method works better than anything else currently available. It also suggests that we can improve existing video generation AI without completely retraining them, just by changing *how* they generate the videos, making it a more practical solution.

Abstract

State-of-the-art video generative models produce promising visual content yet often violate basic physics principles, limiting their utility. While some attribute this deficiency to insufficient physics understanding from pre-training, we find that the shortfall in physics plausibility also stems from suboptimal inference strategies. We therefore introduce WMReward and treat improving physics plausibility of video generation as an inference-time alignment problem. In particular, we leverage the strong physics prior of a latent world model (here, VJEPA-2) as a reward to search and steer multiple candidate denoising trajectories, enabling scaling test-time compute for better generation performance. Empirically, our approach substantially improves physics plausibility across image-conditioned, multiframe-conditioned, and text-conditioned generation settings, with validation from human preference study. Notably, in the ICCV 2025 Perception Test PhysicsIQ Challenge, we achieve a final score of 62.64%, winning first place and outperforming the previous state of the art by 7.42%. Our work demonstrates the viability of using latent world models to improve physics plausibility of video generation, beyond this specific instantiation or parameterization.

View Paper