DreamID-V:Bridging the Image-to-Video Gap for High-Fidelity Face Swapping via Diffusion Transformer

Xu Guo, Fulong Ye, Xinghui Li, Pengqi Tu, Pengze Zhang, Qichao Sun, Songtao Zhao, Xiangwang Hou, Qian He

2026-01-06

DreamID-V:Bridging the Image-to-Video Gap for High-Fidelity Face Swapping via Diffusion Transformer

Summary

This paper introduces a new method, DreamID-V, for creating realistic video face swaps. It aims to take the quality of still image face swaps and apply it to videos, which is much harder because videos have motion and changing conditions.

What's the problem?

Existing video face swapping techniques often struggle to create swaps that look truly convincing. They have trouble maintaining a consistent identity throughout the video, accurately copying the way the face moves and expresses emotions, and blending the swapped face seamlessly with the original video's lighting and background. Basically, the swaps often look unnatural or glitchy, and it's hard to make them look like they really belong in the video.

What's the solution?

The researchers developed DreamID-V, a system built on a few key ideas. First, they created a better way to prepare the training data, called SyncID-Pipe, to help the system learn identities more effectively. Then, they used a new type of neural network called a Diffusion Transformer, which is good at understanding and combining different types of information like facial features and video motion. They also added techniques to make the swaps look more realistic and consistent over time, like a learning process that gradually increases the difficulty of the swaps and a system that rewards the model for maintaining a stable identity. Finally, they created a new dataset, IDBench-V, to better test and compare face swapping methods.

Why it matters?

This research is important because it significantly improves the quality of video face swaps. Better face swapping technology has many potential applications, from creating special effects in movies and video games to allowing for more realistic virtual avatars and potentially even helping people protect their privacy online. The new dataset also provides a valuable resource for other researchers working in this field, allowing them to develop and evaluate even better face swapping techniques.

Abstract

Video Face Swapping (VFS) requires seamlessly injecting a source identity into a target video while meticulously preserving the original pose, expression, lighting, background, and dynamic information. Existing methods struggle to maintain identity similarity and attribute preservation while preserving temporal consistency. To address the challenge, we propose a comprehensive framework to seamlessly transfer the superiority of Image Face Swapping (IFS) to the video domain. We first introduce a novel data pipeline SyncID-Pipe that pre-trains an Identity-Anchored Video Synthesizer and combines it with IFS models to construct bidirectional ID quadruplets for explicit supervision. Building upon paired data, we propose the first Diffusion Transformer-based framework DreamID-V, employing a core Modality-Aware Conditioning module to discriminatively inject multi-model conditions. Meanwhile, we propose a Synthetic-to-Real Curriculum mechanism and an Identity-Coherence Reinforcement Learning strategy to enhance visual realism and identity consistency under challenging scenarios. To address the issue of limited benchmarks, we introduce IDBench-V, a comprehensive benchmark encompassing diverse scenes. Extensive experiments demonstrate DreamID-V outperforms state-of-the-art methods and further exhibits exceptional versatility, which can be seamlessly adapted to various swap-related tasks.

View Paper