UnityVideo: Unified Multi-Modal Multi-Task Learning for Enhancing World-Aware Video Generation
Jiehui Huang, Yuechen Zhang, Xu He, Yuan Gao, Zhi Cen, Bin Xia, Yan Zhou, Xin Tao, Pengfei Wan, Jiaya Jia
2025-12-09
Summary
This paper introduces a new system called UnityVideo that's designed to create more realistic and consistent videos by understanding the world within those videos better than previous methods.
What's the problem?
Current video generation models are limited because they usually only focus on one type of information at a time, like just the images themselves. They don't fully grasp what's happening in the video because they don't consider things like object shapes, how people are moving, or the depth of the scene. This lack of understanding leads to videos that can look unnatural or have inconsistencies, because the model doesn't have a complete picture of the world it's trying to create.
What's the solution?
The researchers developed UnityVideo, which works by combining multiple types of information – like outlines of objects, the positions of joints in a person’s body, how things are moving, and depth information – all at the same time. They also created a clever way to handle these different types of data and a new, large dataset with over a million examples to train the system. This allows UnityVideo to learn how all these elements relate to each other and generate videos that are more realistic and follow the rules of the physical world.
Why it matters?
This research is important because it moves us closer to being able to automatically generate high-quality videos that are truly believable. This has potential applications in areas like creating special effects for movies, developing realistic training simulations, or even generating personalized video content.
Abstract
Recent video generation models demonstrate impressive synthesis capabilities but remain limited by single-modality conditioning, constraining their holistic world understanding. This stems from insufficient cross-modal interaction and limited modal diversity for comprehensive world knowledge representation. To address these limitations, we introduce UnityVideo, a unified framework for world-aware video generation that jointly learns across multiple modalities (segmentation masks, human skeletons, DensePose, optical flow, and depth maps) and training paradigms. Our approach features two core components: (1) dynamic noising to unify heterogeneous training paradigms, and (2) a modality switcher with an in-context learner that enables unified processing via modular parameters and contextual learning. We contribute a large-scale unified dataset with 1.3M samples. Through joint optimization, UnityVideo accelerates convergence and significantly enhances zero-shot generalization to unseen data. We demonstrate that UnityVideo achieves superior video quality, consistency, and improved alignment with physical world constraints. Code and data can be found at: https://github.com/dvlab-research/UnityVideo