Magic Mirror: ID-Preserved Video Generation in Video Diffusion Transformers

Yuechen Zhang, Yaoyang Liu, Bin Xia, Bohao Peng, Zexin Yan, Eric Lo, Jiaya Jia

2025-01-08

Magic Mirror: ID-Preserved Video Generation in Video Diffusion Transformers

Summary

This paper talks about Magic Mirror, a new AI system that can create videos of people with consistent identities and natural movements, based on text descriptions.

What's the problem?

Current AI systems that make videos from text are good at creating moving images, but they struggle to keep a person's identity consistent throughout the video while also making their movements look natural. Some existing methods need to be specially trained for each person, which takes a lot of time and effort.

What's the solution?

The researchers created Magic Mirror, which uses three main parts to solve this problem: 1) A special tool that looks at faces and understands both who the person is and how their face is structured, 2) A system that efficiently combines this face information with the video-making process, and 3) A two-step training method that uses both fake identity pairs and real video data to teach the AI.

Why it matters?

This matters because it could make creating personalized videos much easier and more realistic. It could be used in movies, video games, or social media to create videos of specific people doing things they've never actually done, while still looking natural. This technology could change how we create and consume video content, opening up new possibilities for entertainment, education, and communication.

Abstract

We present Magic Mirror, a framework for generating identity-preserved videos with cinematic-level quality and dynamic motion. While recent advances in video diffusion models have shown impressive capabilities in text-to-video generation, maintaining consistent identity while producing natural motion remains challenging. Previous methods either require person-specific fine-tuning or struggle to balance identity preservation with motion diversity. Built upon Video Diffusion Transformers, our method introduces three key components: (1) a dual-branch facial feature extractor that captures both identity and structural features, (2) a lightweight cross-modal adapter with Conditioned Adaptive Normalization for efficient identity integration, and (3) a two-stage training strategy combining synthetic identity pairs with video data. Extensive experiments demonstrate that Magic Mirror effectively balances identity consistency with natural motion, outperforming existing methods across multiple metrics while requiring minimal parameters added. The code and model will be made publicly available at: https://github.com/dvlab-research/MagicMirror/

View Paper