DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos
Shenyuan Gao, William Liang, Kaiyuan Zheng, Ayaan Malik, Seonghyeon Ye, Sihyun Yu, Wei-Cheng Tseng, Yuzhu Dong, Kaichun Mo, Chen-Hsuan Lin, Qianli Ma, Seungjun Nah, Loic Magne, Jiannan Xiang, Yuqi Xie, Ruijie Zheng, Dantong Niu, You Liang Tan, K. R. Zentner, George Kurian, Suneel Indupuru, Pooya Jannaty
2026-02-09
Summary
This paper introduces DreamDojo, a new system designed to help robots learn how to interact with the world more effectively by 'imagining' the results of their actions. It's a big step towards creating robots that can handle a wide variety of tasks without needing to be specifically programmed for each one.
What's the problem?
Teaching robots to do complex things, especially tasks that require fine motor skills like manipulating objects, is really hard. This is because robots need a lot of data to learn, and getting that data – especially information about *what* actions lead to *what* results – is expensive and time-consuming. Existing methods struggle to generalize to new situations because they haven't seen enough examples of different environments and actions.
What's the solution?
The researchers created DreamDojo by training a 'world model' on a massive dataset of 44,000 hours of videos showing people interacting with everyday objects. Instead of relying on labeled actions (like 'pick up cup'), they used a clever trick called 'continuous latent actions' which allows the model to learn from unlabeled videos and figure out how actions relate to outcomes. They then fine-tuned this model with a smaller amount of robot-specific data, and sped it up so it could run in real-time. Finally, they used a process called distillation to make the model even more consistent and accurate.
Why it matters?
This work is important because it opens the door to more versatile and adaptable robots. By being able to simulate their actions, robots can learn more efficiently, plan better, and even be remotely controlled more easily. DreamDojo shows promising results in challenging scenarios, suggesting it could be a key component in building robots that can handle the complexities of the real world and perform a wide range of tasks.
Abstract
Being able to simulate the outcomes of actions in varied environments will revolutionize the development of generalist agents at scale. However, modeling these world dynamics, especially for dexterous robotics tasks, poses significant challenges due to limited data coverage and scarce action labels. As an endeavor towards this end, we introduce DreamDojo, a foundation world model that learns diverse interactions and dexterous controls from 44k hours of egocentric human videos. Our data mixture represents the largest video dataset to date for world model pretraining, spanning a wide range of daily scenarios with diverse objects and skills. To address the scarcity of action labels, we introduce continuous latent actions as unified proxy actions, enhancing interaction knowledge transfer from unlabeled videos. After post-training on small-scale target robot data, DreamDojo demonstrates a strong understanding of physics and precise action controllability. We also devise a distillation pipeline that accelerates DreamDojo to a real-time speed of 10.81 FPS and further improves context consistency. Our work enables several important applications based on generative world models, including live teleoperation, policy evaluation, and model-based planning. Systematic evaluation on multiple challenging out-of-distribution (OOD) benchmarks verifies the significance of our method for simulating open-world, contact-rich tasks, paving the way for general-purpose robot world models.