DreamID-Omni: Unified Framework for Controllable Human-Centric Audio-Video Generation

Xu Guo, Fulong Ye, Qichao Sun, Liyang Chen, Bingchuan Li, Pengze Zhang, Jiawei Liu, Songtao Zhao, Qian He, Xiangwang Hou

2026-02-26

DreamID-Omni: Unified Framework for Controllable Human-Centric Audio-Video Generation

Summary

This paper introduces DreamID-Omni, a new system for creating realistic audio and video together, focusing on videos with people in them. It aims to give users a lot of control over what's generated, like changing who's in the video, what they're saying, and even their voice.

What's the problem?

Current AI systems that generate audio and video usually handle different tasks – like creating a video from a description, editing an existing video, or animating a video to match audio – as separate problems. This makes it hard to build a single system that can do everything well. A major challenge is also making sure that when multiple people are in a video, the AI doesn't get their identities or voices mixed up, and that you can independently control each person's characteristics.

What's the solution?

DreamID-Omni tackles these issues with a few key ideas. First, it uses a special type of neural network called a 'Symmetric Conditional Diffusion Transformer' to combine different types of instructions (like text, audio, and video) in a smart way. Second, it employs a 'Dual-Level Disentanglement' strategy, which means it uses techniques at both a low level (how the AI pays attention to different parts of the audio and video) and a high level (understanding the relationships between people and their attributes) to keep identities and voices separate. Finally, it uses a training method called 'Multi-Task Progressive Training' to help the AI learn all these different tasks without getting confused or overspecialized.

Why it matters?

This research is important because it represents a significant step forward in creating AI that can generate high-quality, controllable audio and video. DreamID-Omni outperforms existing systems, even some commercial ones, and the researchers are making the code publicly available, which will help accelerate further development in this field and potentially lead to more realistic and versatile video creation tools.

Abstract

Recent advancements in foundation models have revolutionized joint audio-video generation. However, existing approaches typically treat human-centric tasks including reference-based audio-video generation (R2AV), video editing (RV2AV) and audio-driven video animation (RA2V) as isolated objectives. Furthermore, achieving precise, disentangled control over multiple character identities and voice timbres within a single framework remains an open challenge. In this paper, we propose DreamID-Omni, a unified framework for controllable human-centric audio-video generation. Specifically, we design a Symmetric Conditional Diffusion Transformer that integrates heterogeneous conditioning signals via a symmetric conditional injection scheme. To resolve the pervasive identity-timbre binding failures and speaker confusion in multi-person scenarios, we introduce a Dual-Level Disentanglement strategy: Synchronized RoPE at the signal level to ensure rigid attention-space binding, and Structured Captions at the semantic level to establish explicit attribute-subject mappings. Furthermore, we devise a Multi-Task Progressive Training scheme that leverages weakly-constrained generative priors to regularize strongly-constrained tasks, preventing overfitting and harmonizing disparate objectives. Extensive experiments demonstrate that DreamID-Omni achieves comprehensive state-of-the-art performance across video, audio, and audio-visual consistency, even outperforming leading proprietary commercial models. We will release our code to bridge the gap between academic research and commercial-grade applications.

View Paper