SpeakerVid-5M is a large-scale, high-quality dataset for audio-visual dyadic interactive virtual human generation, featuring diverse interaction types and data quality levels, along with an autoregressive video chat baseline and benchmark metrics.

This paper talks about SpeakerVid-5M, a large and high-quality dataset made for creating virtual humans that can interact through both audio and video in conversations.

SpeakerVid-5M: A Large-Scale High-Quality Dataset for Audio-Visual Dyadic Interactive Human Generation

Summary

What's the problem?

What's the solution?

Why it matters?

Abstract