Phantom: Subject-consistent video generation via cross-modal alignment

Lijie Liu, Tianxiang Ma, Bingchuan Li, Zhuowei Chen, Jiawei Liu, Qian He, Xinglong Wu

2025-02-19

Phantom: Subject-consistent video generation via cross-modal alignment

Summary

This paper talks about Phantom, a new AI system that can create videos featuring specific people or objects based on both text descriptions and reference images. It's like having a smart video editor that can take a photo of someone and make a video of them doing whatever you describe in words.

What's the problem?

Current AI video generation systems are good at creating videos from text descriptions or turning one image into a video, but they struggle to combine both inputs effectively. This makes it hard to create videos that feature specific people or objects while also following detailed instructions about what should happen in the video.

What's the solution?

The researchers created Phantom, which uses a special method to understand both text and images together. They redesigned how the AI processes these inputs and taught it using sets of matching text, images, and videos. Phantom can now create videos that keep the appearance of people or objects from reference images while following the actions described in the text. It works for both single subjects and multiple subjects in a video.

Why it matters?

This matters because it opens up new possibilities for video creation. People could use this technology to make personalized videos featuring themselves or specific objects without needing advanced video editing skills. It could be used in entertainment, education, or even for creating custom visual content for businesses. The ability to generate consistent, customized videos based on simple inputs could make video production more accessible and creative for everyone.

Abstract

The continuous development of foundational models for video generation is evolving into various applications, with subject-consistent video generation still in the exploratory stage. We refer to this as Subject-to-Video, which extracts subject elements from reference images and generates subject-consistent video through textual instructions. We believe that the essence of subject-to-video lies in balancing the dual-modal prompts of text and image, thereby deeply and simultaneously aligning both text and visual content. To this end, we propose Phantom, a unified video generation framework for both single and multi-subject references. Building on existing text-to-video and image-to-video architectures, we redesign the joint text-image injection model and drive it to learn cross-modal alignment via text-image-video triplet data. In particular, we emphasize subject consistency in human generation, covering existing ID-preserving video generation while offering enhanced advantages. The project homepage is here https://phantom-video.github.io/Phantom/.

View Paper