Understanding Co-speech Gestures in-the-wild
Sindhu B Hegde, K R Prajwal, Taein Kwon, Andrew Zisserman
2025-04-01
Summary
This paper is about teaching computers to understand gestures people make when they talk, in real-life situations.
What's the problem?
It's difficult for computers to understand the connection between gestures, words, and speech in natural, unscripted videos.
What's the solution?
The researchers developed a new AI system that learns to connect gestures with speech and text, allowing it to understand the meaning of gestures in videos.
Why it matters?
This work matters because it can improve how computers understand human communication, leading to better AI assistants and video analysis tools.
Abstract
Co-speech gestures play a vital role in non-verbal communication. In this paper, we introduce a new framework for co-speech gesture understanding in the wild. Specifically, we propose three new tasks and benchmarks to evaluate a model's capability to comprehend gesture-text-speech associations: (i) gesture-based retrieval, (ii) gestured word spotting, and (iii) active speaker detection using gestures. We present a new approach that learns a tri-modal speech-text-video-gesture representation to solve these tasks. By leveraging a combination of global phrase contrastive loss and local gesture-word coupling loss, we demonstrate that a strong gesture representation can be learned in a weakly supervised manner from videos in the wild. Our learned representations outperform previous methods, including large vision-language models (VLMs), across all three tasks. Further analysis reveals that speech and text modalities capture distinct gesture-related signals, underscoring the advantages of learning a shared tri-modal embedding space. The dataset, model, and code are available at: https://www.robots.ox.ac.uk/~vgg/research/jegal