Planning with Sketch-Guided Verification for Physics-Aware Video Generation

Yidong Huang, Zun Wang, Han Lin, Dong-Ki Kim, Shayegan Omidshafiei, Jaehong Yoon, Yue Zhang, Mohit Bansal

2025-11-24

Planning with Sketch-Guided Verification for Physics-Aware Video Generation

Summary

This paper introduces a new way to plan out the movements in videos generated by AI, making those movements look more realistic and follow instructions better.

What's the problem?

Currently, AI video generation struggles with creating videos where things move naturally and consistently over time. Existing methods either plan simple movements upfront, or repeatedly ask the AI to refine the video, which takes a lot of computing power. Simple plans aren't good enough, and repeated refinement is too slow and expensive.

What's the solution?

The researchers developed a system called SketchVerify that improves motion planning *before* the full video is created. It works by generating several possible movement plans, then quickly checking how well each plan matches the instructions and looks physically realistic. It does this by creating quick, simple 'sketches' of the movement instead of fully rendering each option, saving a lot of time. The best plan is then used to create the final video.

Why it matters?

This research is important because it allows for the creation of more believable and complex videos with AI, while also being much more efficient than previous methods. Better motion planning means videos can show more dynamic and realistic actions, and the speed improvements make it more practical to generate these videos.

Abstract

Recent video generation approaches increasingly rely on planning intermediate control signals such as object trajectories to improve temporal coherence and motion fidelity. However, these methods mostly employ single-shot plans that are typically limited to simple motions, or iterative refinement which requires multiple calls to the video generator, incuring high computational cost. To overcome these limitations, we propose SketchVerify, a training-free, sketch-verification-based planning framework that improves motion planning quality with more dynamically coherent trajectories (i.e., physically plausible and instruction-consistent motions) prior to full video generation by introducing a test-time sampling and verification loop. Given a prompt and a reference image, our method predicts multiple candidate motion plans and ranks them using a vision-language verifier that jointly evaluates semantic alignment with the instruction and physical plausibility. To efficiently score candidate motion plans, we render each trajectory as a lightweight video sketch by compositing objects over a static background, which bypasses the need for expensive, repeated diffusion-based synthesis while achieving comparable performance. We iteratively refine the motion plan until a satisfactory one is identified, which is then passed to the trajectory-conditioned generator for final synthesis. Experiments on WorldModelBench and PhyWorldBench demonstrate that our method significantly improves motion quality, physical realism, and long-term consistency compared to competitive baselines while being substantially more efficient. Our ablation study further shows that scaling up the number of trajectory candidates consistently enhances overall performance.

View Paper