Multi-human Interactive Talking Dataset

Zeyu Zhu, Weijia Wu, Mike Zheng Shou

2025-08-06

Summary

This paper talks about the Multi-human Interactive Talking Dataset (MIT), a big collection of videos showing multiple people talking together, with detailed labels to help train AI models for generating realistic multi-person talking videos.

What's the problem?

The problem is that existing datasets mostly focus on single-person talking videos, which makes it hard for AI to learn how to generate natural conversations involving multiple people interacting at the same time.

What's the solution?

The paper introduces MIT with detailed annotations like body poses and audio features, and uses it to develop CovOG, a model that combines a special encoder to understand multiple human poses with audio cues to generate realistic videos of people talking interactively.

Why it matters?

This matters because it helps improve AI's ability to create believable multi-person video conversations, which can be useful for virtual meetings, entertainment, and other interactive applications.

Abstract

MIT, a large-scale dataset for multi-human talking video generation, includes fine-grained annotations and is used to demonstrate CovOG, a baseline model integrating a Multi-Human Pose Encoder and an Interactive Audio Driver.

View Paper