TalkVid: A Large-Scale Diversified Dataset for Audio-Driven Talking Head Synthesis

Shunian Chen, Hejin Huang, Yexin Liu, Zihan Ye, Pengcheng Chen, Chenghao Zhu, Michael Guan, Rongsheng Wang, Junying Chen, Guanbin Li, Ser-Nam Lim, Harry Yang, Benyou Wang

2025-09-01

TalkVid: A Large-Scale Diversified Dataset for Audio-Driven Talking Head Synthesis

Summary

This paper focuses on creating realistic talking head videos from audio, but points out that current technology doesn't work equally well for everyone. It introduces a new dataset called TalkVid designed to fix this issue.

What's the problem?

Existing 'talking head' technology, while visually impressive, struggles to create realistic videos for people of different ethnicities, ages, and who speak different languages. This happens because the data used to train these systems isn't diverse enough – it lacks enough examples from a wide range of people and isn't always high quality. Essentially, the AI learns to make faces that look real for *some* people, but not for everyone.

What's the solution?

The researchers created TalkVid, a massive new dataset of over 1200 hours of video featuring nearly 8000 different people. They didn't just collect videos randomly; they used a careful process to ensure the videos were stable, looked good, and showed clear facial details. They also built a special test set, TalkVid-Bench, that's specifically designed to measure how well the technology works across different groups of people. They then showed that AI models trained on TalkVid perform better and are more reliable than those trained on older datasets.

Why it matters?

This work is important because it highlights the bias that can creep into AI systems when the data they're trained on isn't representative of the real world. By creating a more diverse dataset and a better way to test these systems, the researchers are helping to ensure that 'talking head' technology can be used fairly and accurately for everyone, not just a select few. It also provides a valuable resource for future research in this area.

Abstract

Audio-driven talking head synthesis has achieved remarkable photorealism, yet state-of-the-art (SOTA) models exhibit a critical failure: they lack generalization to the full spectrum of human diversity in ethnicity, language, and age groups. We argue that this generalization gap is a direct symptom of limitations in existing training data, which lack the necessary scale, quality, and diversity. To address this challenge, we introduce TalkVid, a new large-scale, high-quality, and diverse dataset containing 1244 hours of video from 7729 unique speakers. TalkVid is curated through a principled, multi-stage automated pipeline that rigorously filters for motion stability, aesthetic quality, and facial detail, and is validated against human judgments to ensure its reliability. Furthermore, we construct and release TalkVid-Bench, a stratified evaluation set of 500 clips meticulously balanced across key demographic and linguistic axes. Our experiments demonstrate that a model trained on TalkVid outperforms counterparts trained on previous datasets, exhibiting superior cross-dataset generalization. Crucially, our analysis on TalkVid-Bench reveals performance disparities across subgroups that are obscured by traditional aggregate metrics, underscoring its necessity for future research. Code and data can be found in https://github.com/FreedomIntelligence/TalkVid

View Paper