AnyTalker: Scaling Multi-Person Talking Video Generation with Interactivity Refinement
Zhizhou Zhong, Yicheng Ji, Zhe Kong, Yiying Liu, Jiarui Wang, Jiasun Feng, Lupeng Liu, Xiangyi Wang, Yanjia Li, Yuqing She, Ying Qin, Huan Li, Shuiyang Mao, Wei Liu, Wenhan Luo
2025-12-01
Summary
This paper introduces AnyTalker, a new system for creating realistic videos of multiple people talking, all driven by just an audio track.
What's the problem?
Making videos of multiple people interacting is hard because getting enough video footage of groups of people is expensive and time-consuming. Existing methods struggle to create believable interactions between multiple people and to control each person's identity accurately when generating the video from audio.
What's the solution?
AnyTalker solves this by using a special kind of AI architecture that focuses on processing information about each person's identity and the audio separately, but in a way that allows them to interact. It's trained mostly on videos of *single* people talking, and then fine-tuned with a small amount of real multi-person video to improve how natural the interactions look. They also created a new way to measure how good the generated videos are, focusing on both how well the lips match the audio and how naturally the people interact.
Why it matters?
This research is important because it makes it much easier and cheaper to generate realistic videos with multiple people. This could be useful for creating virtual meetings, special effects in movies, or even personalized video content, without needing to film real people in those scenarios. It strikes a good balance between needing a lot of data and being able to create videos with many different people.
Abstract
Recently, multi-person video generation has started to gain prominence. While a few preliminary works have explored audio-driven multi-person talking video generation, they often face challenges due to the high costs of diverse multi-person data collection and the difficulty of driving multiple identities with coherent interactivity. To address these challenges, we propose AnyTalker, a multi-person generation framework that features an extensible multi-stream processing architecture. Specifically, we extend Diffusion Transformer's attention block with a novel identity-aware attention mechanism that iteratively processes identity-audio pairs, allowing arbitrary scaling of drivable identities. Besides, training multi-person generative models demands massive multi-person data. Our proposed training pipeline depends solely on single-person videos to learn multi-person speaking patterns and refines interactivity with only a few real multi-person clips. Furthermore, we contribute a targeted metric and dataset designed to evaluate the naturalness and interactivity of the generated multi-person videos. Extensive experiments demonstrate that AnyTalker achieves remarkable lip synchronization, visual quality, and natural interactivity, striking a favorable balance between data costs and identity scalability.