Multi-subject Open-set Personalization in Video Generation

Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Yuwei Fang, Kwot Sin Lee, Ivan Skorokhodov, Kfir Aberman, Jun-Yan Zhu, Ming-Hsuan Yang, Sergey Tulyakov

2025-01-13

Multi-subject Open-set Personalization in Video Generation

Summary

This paper talks about Video Alchemist, a new AI system that can create personalized videos featuring multiple specific people, objects, or backgrounds without needing lengthy adjustments for each new subject.

What's the problem?

Current video personalization methods are limited. They often work only for specific types of subjects (like just faces), can only handle one subject at a time, or need time-consuming adjustments for each new subject. Also, it's hard to create good training data and evaluate how well these systems work.

What's the solution?

The researchers created Video Alchemist, which can handle multiple subjects and different types of objects or backgrounds at once. It uses a special AI technique called a Diffusion Transformer that combines reference images with text descriptions. They also developed a new way to create training data by using frames from existing videos and applying various image changes. Finally, they created a new method to test how well the system performs.

Why it matters?

Video Alchemist matters because it makes creating personalized videos much easier and more flexible. It can include multiple specific people or objects in new settings without needing lots of extra work. This could be really useful for things like making custom videos for social media, education, or entertainment. The new testing method they created will also help improve future AI video systems. Overall, this technology could make personalized video content more accessible and diverse.

Abstract

Video personalization methods allow us to synthesize videos with specific concepts such as people, pets, and places. However, existing methods often focus on limited domains, require time-consuming optimization per subject, or support only a single subject. We present Video Alchemist - a video model with built-in multi-subject, open-set personalization capabilities for both foreground objects and background, eliminating the need for time-consuming test-time optimization. Our model is built on a new Diffusion Transformer module that fuses each conditional reference image and its corresponding subject-level text prompt with cross-attention layers. Developing such a large model presents two main challenges: dataset and evaluation. First, as paired datasets of reference images and videos are extremely hard to collect, we sample selected video frames as reference images and synthesize a clip of the target video. However, while models can easily denoise training videos given reference frames, they fail to generalize to new contexts. To mitigate this issue, we design a new automatic data construction pipeline with extensive image augmentations. Second, evaluating open-set video personalization is a challenge in itself. To address this, we introduce a personalization benchmark that focuses on accurate subject fidelity and supports diverse personalization scenarios. Finally, our extensive experiments show that our method significantly outperforms existing personalization methods in both quantitative and qualitative evaluations.

View Paper