Dream2Flow: Bridging Video Generation and Open-World Manipulation with 3D Object Flow

Karthik Dharmarajan, Wenlong Huang, Jiajun Wu, Li Fei-Fei, Ruohan Zhang

2026-01-02

Dream2Flow: Bridging Video Generation and Open-World Manipulation with 3D Object Flow

Summary

This paper introduces a new method called Dream2Flow that helps robots learn to manipulate objects by watching videos. It's about getting robots to perform tasks in the real world based on what they've 'seen' in videos, even if they haven't been specifically programmed for that task.

What's the problem?

Currently, it's difficult for robots to directly translate the movements they observe in videos into the precise actions they need to take with their motors and joints. Video models can *show* plausible object movements, but robots struggle to turn those movements into actual robotic commands because of differences in how robots and videos represent the world. This disconnect is called the 'embodiment gap'.

What's the solution?

Dream2Flow solves this by using a middle step: it reconstructs the 3D movement of objects *from* the generated videos. Instead of trying to directly copy the video's actions, the system figures out *where* the objects need to go. Then, it plans a path for the robot to move the objects along that 3D trajectory, using either pre-calculated paths or a learning process. This separates *what* needs to happen (object movement) from *how* the robot does it (motor commands).

Why it matters?

This is important because it allows robots to learn new manipulation skills simply by watching videos, without needing someone to demonstrate the task repeatedly. It works with different types of objects – rigid, bendy, granular – making it a versatile approach to open-world robotics, where robots encounter a wide variety of situations and objects.

Abstract

Generative video modeling has emerged as a compelling tool to zero-shot reason about plausible physical interactions for open-world manipulation. Yet, it remains a challenge to translate such human-led motions into the low-level actions demanded by robotic systems. We observe that given an initial image and task instruction, these models excel at synthesizing sensible object motions. Thus, we introduce Dream2Flow, a framework that bridges video generation and robotic control through 3D object flow as an intermediate representation. Our method reconstructs 3D object motions from generated videos and formulates manipulation as object trajectory tracking. By separating the state changes from the actuators that realize those changes, Dream2Flow overcomes the embodiment gap and enables zero-shot guidance from pre-trained video models to manipulate objects of diverse categories-including rigid, articulated, deformable, and granular. Through trajectory optimization or reinforcement learning, Dream2Flow converts reconstructed 3D object flow into executable low-level commands without task-specific demonstrations. Simulation and real-world experiments highlight 3D object flow as a general and scalable interface for adapting video generation models to open-world robotic manipulation. Videos and visualizations are available at https://dream2flow.github.io/.

View Paper