Phantom: Physics-Infused Video Generation via Joint Modeling of Visual and Latent Physical Dynamics

Ying Shen, Jerry Xiong, Tianjiao Yu, Ismini Lourentzou

2026-04-10

Phantom: Physics-Infused Video Generation via Joint Modeling of Visual and Latent Physical Dynamics

Summary

This paper introduces a new method for creating realistic videos that also follow the rules of physics, unlike many current video generation techniques.

What's the problem?

While recent video generation models can create videos that *look* real, they often show unrealistic movements and behaviors because they don't understand how things actually work in the real world – things like gravity or how objects bounce. Simply making the models bigger and feeding them more data doesn't automatically give them this understanding of physics.

What's the solution?

The researchers developed a model called Phantom that tries to solve this by figuring out the underlying physical properties of what's happening in a video *while* it's being generated. It creates a kind of abstract representation of the physics involved, allowing it to predict how things should move and behave realistically alongside generating the visual content of the video. Essentially, it's building physics knowledge directly into the video creation process.

Why it matters?

This is important because it moves video generation beyond just making things *look* good to making them *act* realistically. This could be crucial for creating simulations, training AI agents, or even just making more believable special effects in movies and games. Phantom shows that explicitly considering physics leads to videos that are both visually appealing and physically accurate, outperforming other methods.

Abstract

Recent advances in generative video modeling, driven by large-scale datasets and powerful architectures, have yielded remarkable visual realism. However, emerging evidence suggests that simply scaling data and model size does not endow these systems with an understanding of the underlying physical laws that govern real-world dynamics. Existing approaches often fail to capture or enforce such physical consistency, resulting in unrealistic motion and dynamics. In his work, we investigate whether integrating the inference of latent physical properties directly into the video generation process can equip models with the ability to produce physically plausible videos. To this end, we propose Phantom, a Physics-Infused Video Generation model that jointly models the visual content and latent physical dynamics. Conditioned on observed video frames and inferred physical states, Phantom jointly predicts latent physical dynamics and generates future video frames. Phantom leverages a physics-aware video representation that serves as an abstract yet informaive embedding of the underlying physics, facilitating the joint prediction of physical dynamics alongside video content without requiring an explicit specification of a complex set of physical dynamics and properties. By integrating the inference of physical-aware video representation directly into the video generation process, Phantom produces video sequences that are both visually realistic and physically consistent. Quantitative and qualitative results on both standard video generation and physics-aware benchmarks demonstrate that Phantom not only outperforms existing methods in terms of adherence to physical dynamics but also delivers competitive perceptual fidelity.

View Paper