DreamVideo-Omni: Omni-Motion Controlled Multi-Subject Video Customization with Latent Identity Reinforcement Learning

Yujie Wei, Xinyu Liu, Shiwei Zhang, Hangjie Yuan, Jinbo Xing, Zhekai Chen, Xiang Wang, Haonan Qiu, Rui Zhao, Yutong Feng, Ruihang Chu, Yingya Zhang, Yike Guo, Xihui Liu, Hongming Shan

2026-03-13

DreamVideo-Omni: Omni-Motion Controlled Multi-Subject Video Customization with Latent Identity Reinforcement Learning

Summary

This paper introduces DreamVideo-Omni, a new system for creating videos with a lot of control over what happens in them, specifically focusing on managing multiple people and their movements.

What's the problem?

Currently, making videos with AI is getting really good, but it's hard to precisely control everything. Existing methods struggle with controlling multiple people at once, making their movements look natural at different levels of detail, and keeping everyone looking like themselves throughout the video. Often, you get blurry movements or people start to look different than they should.

What's the solution?

DreamVideo-Omni tackles this in two main steps. First, it learns to understand different types of instructions – how people look, big movements, small details, and camera angles – all at the same time. It uses a special coding system to help these different instructions work together and makes sure the main movements are followed. To keep track of who's who, it assigns unique 'tags' to each person and their actions. Second, it refines the video by checking if the people still look like themselves, using a system that rewards the AI for keeping identities consistent and natural, even while they're moving.

Why it matters?

This research is important because it brings us closer to being able to easily create realistic and customized videos with AI. Imagine being able to direct a video with specific actions for each person, and having the AI flawlessly execute it while maintaining everyone’s appearance. This has potential applications in filmmaking, special effects, and even personalized content creation.

Abstract

While large-scale diffusion models have revolutionized video synthesis, achieving precise control over both multi-subject identity and multi-granularity motion remains a significant challenge. Recent attempts to bridge this gap often suffer from limited motion granularity, control ambiguity, and identity degradation, leading to suboptimal performance on identity preservation and motion control. In this work, we present DreamVideo-Omni, a unified framework enabling harmonious multi-subject customization with omni-motion control via a progressive two-stage training paradigm. In the first stage, we integrate comprehensive control signals for joint training, encompassing subject appearances, global motion, local dynamics, and camera movements. To ensure robust and precise controllability, we introduce a condition-aware 3D rotary positional embedding to coordinate heterogeneous inputs and a hierarchical motion injection strategy to enhance global motion guidance. Furthermore, to resolve multi-subject ambiguity, we introduce group and role embeddings to explicitly anchor motion signals to specific identities, effectively disentangling complex scenes into independent controllable instances. In the second stage, to mitigate identity degradation, we design a latent identity reward feedback learning paradigm by training a latent identity reward model upon a pretrained video diffusion backbone. This provides motion-aware identity rewards in the latent space, prioritizing identity preservation aligned with human preferences. Supported by our curated large-scale dataset and the comprehensive DreamOmni Bench for multi-subject and omni-motion control evaluation, DreamVideo-Omni demonstrates superior performance in generating high-quality videos with precise controllability.

View Paper