LumosX: Relate Any Identities with Their Attributes for Personalized Video Generation

Jiazheng Xing, Fei Du, Hangjie Yuan, Pengwei Liu, Hongbin Xu, Hai Ci, Ruigang Niu, Weihua Chen, Fan Wang, Yong Liu

2026-03-23

LumosX: Relate Any Identities with Their Attributes for Personalized Video Generation

Summary

This paper introduces LumosX, a new system for creating personalized videos from text descriptions, focusing on making sure faces and attributes stay consistent when multiple people are in the video.

What's the problem?

Current text-to-video technology is getting good at generating videos from text, but it struggles to maintain consistent facial features and attributes when you want to show multiple people doing different things. For example, if you ask it to show 'Alice smiling and Bob frowning,' existing systems might change Alice's face slightly between shots or not accurately represent Bob's frown consistently. They lack a way to explicitly link attributes to specific people throughout the video.

What's the solution?

The researchers tackled this problem in two main ways. First, they created a special dataset by carefully combining captions and visual information from existing videos, and then used AI to figure out relationships between people in those videos. This gives the system a better understanding of who's who and what their attributes are. Second, they designed a new model with something called 'Relational Attention,' which helps the system pay attention to the correct person when applying specific attributes, ensuring consistency. This attention mechanism links facial features and attributes to specific individuals within the video.

Why it matters?

This work is important because it significantly improves the quality and realism of personalized video generation, especially when multiple people are involved. It allows for more precise control over the video content and makes it possible to create videos where individuals maintain their unique identities and expressions throughout, opening up possibilities for more creative and customized video content.

Abstract

Recent advances in diffusion models have significantly improved text-to-video generation, enabling personalized content creation with fine-grained control over both foreground and background elements. However, precise face-attribute alignment across subjects remains challenging, as existing methods lack explicit mechanisms to ensure intra-group consistency. Addressing this gap requires both explicit modeling strategies and face-attribute-aware data resources. We therefore propose LumosX, a framework that advances both data and model design. On the data side, a tailored collection pipeline orchestrates captions and visual cues from independent videos, while multimodal large language models (MLLMs) infer and assign subject-specific dependencies. These extracted relational priors impose a finer-grained structure that amplifies the expressive control of personalized video generation and enables the construction of a comprehensive benchmark. On the modeling side, Relational Self-Attention and Relational Cross-Attention intertwine position-aware embeddings with refined attention dynamics to inscribe explicit subject-attribute dependencies, enforcing disciplined intra-group cohesion and amplifying the separation between distinct subject clusters. Comprehensive evaluations on our benchmark demonstrate that LumosX achieves state-of-the-art performance in fine-grained, identity-consistent, and semantically aligned personalized multi-subject video generation. Code and models are available at https://jiazheng-xing.github.io/lumosx-home/.

View Paper