ContextAnyone: Context-Aware Diffusion for Character-Consistent Text-to-Video Generation
Ziyang Mai, Yu-Wing Tai
2025-12-17
Summary
This paper introduces a new method called ContextAnyone for creating videos from text descriptions and a single reference image, focusing on keeping the character in the video looking consistent throughout.
What's the problem?
Currently, turning text into video and personalizing it with a specific person is difficult because existing methods mainly focus on the face and often forget other important details like hair, clothes, and body type. This leads to videos where the character seems to change appearance from scene to scene, breaking the illusion of a single, consistent person.
What's the solution?
ContextAnyone tackles this by looking at the entire reference image, not just the face, and using a special technique to make sure the video generation process really pays attention to all the details. It uses something called an 'Emphasize-Attention module' to highlight the important features from the reference image and prevent the character from changing. It also uses a clever way of organizing information to help the video stay consistent over time, and a special loss function to make the video look realistic and match the reference image.
Why it matters?
This research is important because it makes it possible to create more believable and engaging videos from text. By maintaining consistent character identities, the videos feel more natural and less jarring, which is crucial for applications like storytelling, personalized content creation, and even virtual reality.
Abstract
Text-to-video (T2V) generation has advanced rapidly, yet maintaining consistent character identities across scenes remains a major challenge. Existing personalization methods often focus on facial identity but fail to preserve broader contextual cues such as hairstyle, outfit, and body shape, which are critical for visual coherence. We propose ContextAnyone, a context-aware diffusion framework that achieves character-consistent video generation from text and a single reference image. Our method jointly reconstructs the reference image and generates new video frames, enabling the model to fully perceive and utilize reference information. Reference information is effectively integrated into a DiT-based diffusion backbone through a novel Emphasize-Attention module that selectively reinforces reference-aware features and prevents identity drift across frames. A dual-guidance loss combines diffusion and reference reconstruction objectives to enhance appearance fidelity, while the proposed Gap-RoPE positional embedding separates reference and video tokens to stabilize temporal modeling. Experiments demonstrate that ContextAnyone outperforms existing reference-to-video methods in identity consistency and visual quality, generating coherent and context-preserving character videos across diverse motions and scenes. Project page: https://github.com/ziyang1106/ContextAnyone{https://github.com/ziyang1106/ContextAnyone}.