DynamicVerse: A Physically-Aware Multimodal Framework for 4D World Modeling
Kairun Wen, Yuzhi Huang, Runyu Chen, Hui Zheng, Yunlong Lin, Panwang Pan, Chenxin Li, Wenyan Cong, Jian Zhang, Junbin Lu, Chenguo Lin, Dilin Wang, Zhicheng Yan, Hongyu Xu, Justin Theiss, Yue Huang, Xinghao Ding, Rakesh Ranjan, Zhiwen Fan
2025-12-05
Summary
This paper introduces DynamicVerse, a new way to create a detailed, 4D model of the real world from regular videos found online. It combines visual information with understanding of geometry and language to help computers better understand how things move and exist in the physical world.
What's the problem?
Currently, it's hard for computers to truly understand real-world videos because the datasets they learn from are either created in unrealistic simulations or lack detailed descriptions. Existing methods for creating 3D models from videos often aren't accurate enough for understanding real-world scale and movement, and they don't provide enough descriptive information about what's happening in the video. This limits how well AI can interact with the real world like humans do.
What's the solution?
The researchers developed DynamicVerse, a system that takes long videos and turns them into a comprehensive 4D model. They use advanced AI models to figure out the shape of objects, how they move, what objects are present, and provide detailed descriptions of the scene. A key technique involves carefully combining information from different parts of the video to create a globally accurate model, even over long sequences. This resulted in a large dataset of over 100,000 videos with detailed annotations.
Why it matters?
This work is important because it provides a much more realistic and detailed dataset for training AI. By improving the ability of AI to understand real-world videos, it can lead to better robots, self-driving cars, and virtual reality experiences that can interact with the world in a more natural and intelligent way. The improved accuracy in measuring depth, camera position, and camera properties is a significant step forward.
Abstract
Understanding the dynamic physical world, characterized by its evolving 3D structure, real-world motion, and semantic content with textual descriptions, is crucial for human-agent interaction and enables embodied agents to perceive and act within real environments with human-like capabilities. However, existing datasets are often derived from limited simulators or utilize traditional Structurefrom-Motion for up-to-scale annotation and offer limited descriptive captioning, which restricts the capacity of foundation models to accurately interpret real-world dynamics from monocular videos, commonly sourced from the internet. To bridge these gaps, we introduce DynamicVerse, a physical-scale, multimodal 4D world modeling framework for dynamic real-world video. We employ large vision, geometric, and multimodal models to interpret metric-scale static geometry, real-world dynamic motion, instance-level masks, and holistic descriptive captions. By integrating window-based Bundle Adjustment with global optimization, our method converts long real-world video sequences into a comprehensive 4D multimodal format. DynamicVerse delivers a large-scale dataset consisting of 100K+ videos with 800K+ annotated masks and 10M+ frames from internet videos. Experimental evaluations on three benchmark tasks, namely video depth estimation, camera pose estimation, and camera intrinsics estimation, demonstrate that our 4D modeling achieves superior performance in capturing physical-scale measurements with greater global accuracy than existing methods.