CustomVideoX: 3D Reference Attention Driven Dynamic Adaptation for Zero-Shot Customized Video Diffusion Transformers

D. She, Mushui Liu, Jingxuan Pang, Jin Wang, Zhen Yang, Wanggui He, Guanghao Zhang, Yi Wang, Qihan Huang, Haobin Tang, Yunlong Yu, Siming Fu

2025-02-11

CustomVideoX: 3D Reference Attention Driven Dynamic Adaptation for
Zero-Shot Customized Video Diffusion Transformers

Summary

This paper talks about CustomVideoX, a new AI system that can create personalized videos based on a single reference image. It's designed to make videos that are more consistent and higher quality than what current methods can produce.

What's the problem?

While AI has gotten really good at creating custom images, making personalized videos is still tough. The main issues are keeping the video consistent throughout and maintaining high quality. Current methods often struggle with these aspects, especially when trying to create a video based on just one reference image.

What's the solution?

The researchers created CustomVideoX, which uses several clever techniques to improve video generation. It uses something called 3D Reference Attention to help the AI understand how the reference image relates to each part of the video. They also added features like Time-Aware Reference Attention Bias and Entity Region-Aware Enhancement to make sure the generated video stays true to the reference image without being too influenced by it. To test their system, they created a new benchmark called VideoBench with lots of different objects and prompts.

Why it matters?

This matters because it could make creating personalized videos much easier and more accessible. Imagine being able to turn a single photo into a high-quality video that matches your vision. This could be huge for fields like advertising, entertainment, and education, where custom video content is valuable but often expensive and time-consuming to produce. It's a big step towards making AI-generated videos more practical and useful in everyday applications.

Abstract

Customized generation has achieved significant progress in image synthesis, yet personalized video generation remains challenging due to temporal inconsistencies and quality degradation. In this paper, we introduce CustomVideoX, an innovative framework leveraging the video diffusion transformer for personalized video generation from a reference image. CustomVideoX capitalizes on pre-trained video networks by exclusively training the LoRA parameters to extract reference features, ensuring both efficiency and adaptability. To facilitate seamless interaction between the reference image and video content, we propose 3D Reference Attention, which enables direct and simultaneous engagement of reference image features with all video frames across spatial and temporal dimensions. To mitigate the excessive influence of reference image features and textual guidance on generated video content during inference, we implement the Time-Aware Reference Attention Bias (TAB) strategy, dynamically modulating reference bias over different time steps. Additionally, we introduce the Entity Region-Aware Enhancement (ERAE) module, aligning highly activated regions of key entity tokens with reference feature injection by adjusting attention bias. To thoroughly evaluate personalized video generation, we establish a new benchmark, VideoBench, comprising over 50 objects and 100 prompts for extensive assessment. Experimental results show that CustomVideoX significantly outperforms existing methods in terms of video consistency and quality.

View Paper