SpeakerVid-5M: A Large-Scale High-Quality Dataset for Audio-Visual Dyadic Interactive Human Generation
Youliang Zhang, Zhaoyang Li, Duomin Wang, Jiahe Zhang, Deyu Zhou, Zixin Yin, Xili Dai, Gang Yu, Xiu Li
2025-07-15
Summary
This paper talks about SpeakerVid-5M, a large and high-quality dataset made for creating virtual humans that can interact through both audio and video in conversations.
What's the problem?
Making realistic virtual humans that can engage in natural back-and-forth audio-visual conversations is difficult because there aren’t many big and diverse datasets with good quality for training AI to understand and generate such interactions.
What's the solution?
The researchers collected and created SpeakerVid-5M with millions of diverse examples of people interacting through video and sound. They also provided a baseline AI model that generates interactive video chats and metrics to measure how well these models perform.
Why it matters?
This matters because SpeakerVid-5M helps improve AI systems that create virtual humans, making them more natural and realistic in conversations, which can be used in entertainment, education, customer service, and more.
Abstract
SpeakerVid-5M is a large-scale, high-quality dataset for audio-visual dyadic interactive virtual human generation, featuring diverse interaction types and data quality levels, along with an autoregressive video chat baseline and benchmark metrics.