Goal Force: Teaching Video Models To Accomplish Physics-Conditioned Goals

Nate Gillman, Yinghua Zhou, Zitian Tang, Evan Luo, Arjan Chakravarthy, Daksh Aggarwal, Michael Freeman, Charles Herrmann, Chen Sun

2026-01-12

Goal Force: Teaching Video Models To Accomplish Physics-Conditioned Goals

Summary

This paper introduces a new way to tell robots what to do, using the idea of 'forces' instead of just words or pictures. It's about making robots understand how the physical world works so they can plan actions more effectively.

What's the problem?

Currently, it's hard to give robots clear instructions. Telling them what you want with words is often too vague for physical tasks, and showing them a final picture doesn't help them figure out *how* to get there, especially if things are moving. Existing methods struggle with the complexities of real-world physics and dynamic situations.

What's the solution?

The researchers developed a system called 'Goal Force'. Instead of telling a robot the end goal directly, you specify the forces that should be applied and how things should move along the way. They trained a video generation model using simple physics simulations – things like collisions and falling objects – to learn how forces affect objects over time. Surprisingly, this model then worked well on much more complicated, real-world tasks like using tools and setting up chain reactions.

Why it matters?

This is important because it allows robots to plan actions based on a better understanding of physics. Instead of needing a separate physics engine to figure out if a plan will work, the robot's 'brain' (the video generation model) learns to simulate physics itself. This means robots can plan more accurately and reliably, and it opens the door to more complex and adaptable robotic systems.

Abstract

Recent advancements in video generation have enabled the development of ``world models'' capable of simulating potential futures for robotics and planning. However, specifying precise goals for these models remains a challenge; text instructions are often too abstract to capture physical nuances, while target images are frequently infeasible to specify for dynamic tasks. To address this, we introduce Goal Force, a novel framework that allows users to define goals via explicit force vectors and intermediate dynamics, mirroring how humans conceptualize physical tasks. We train a video generation model on a curated dataset of synthetic causal primitives-such as elastic collisions and falling dominos-teaching it to propagate forces through time and space. Despite being trained on simple physics data, our model exhibits remarkable zero-shot generalization to complex, real-world scenarios, including tool manipulation and multi-object causal chains. Our results suggest that by grounding video generation in fundamental physical interactions, models can emerge as implicit neural physics simulators, enabling precise, physics-aware planning without reliance on external engines. We release all datasets, code, model weights, and interactive video demos at our project page.

View Paper