Social-MAE: A Transformer-Based Multimodal Autoencoder for Face and Voice

Hugo Bohy, Minh Tran, Kevin El Haddad, Thierry Dutoit, Mohammad Soleymani

2025-08-29

Social-MAE: A Transformer-Based Multimodal Autoencoder for Face and Voice

Summary

This paper introduces a new artificial intelligence model called Social-MAE, designed to better understand human interactions by analyzing both video and audio at the same time.

What's the problem?

Understanding how people interact requires considering more than just what they say; things like facial expressions, body language, and tone of voice are crucial. Existing AI models often struggle to effectively combine these different types of information, especially when learning from unlabeled data where the AI has to figure things out on its own.

What's the solution?

The researchers built Social-MAE, which is an improved version of a previous model. They trained it by showing it a huge amount of video and audio recordings of people interacting, but instead of telling it what was happening, they had it learn by predicting missing parts of the video and audio. This 'fill-in-the-blanks' approach helps the model understand the relationships between what people do and say. They then tested it on tasks like recognizing emotions, detecting laughter, and guessing someone's personality.

Why it matters?

This work is important because it pushes the boundaries of AI's ability to understand social cues. A more accurate understanding of human behavior could lead to improvements in areas like virtual assistants, mental health support, and even creating more realistic characters in video games.

Abstract

Human social behaviors are inherently multimodal necessitating the development of powerful audiovisual models for their perception. In this paper, we present Social-MAE, our pre-trained audiovisual Masked Autoencoder based on an extended version of Contrastive Audio-Visual Masked Auto-Encoder (CAV-MAE), which is pre-trained on audiovisual social data. Specifically, we modify CAV-MAE to receive a larger number of frames as input and pre-train it on a large dataset of human social interaction (VoxCeleb2) in a self-supervised manner. We demonstrate the effectiveness of this model by finetuning and evaluating the model on different social and affective downstream tasks, namely, emotion recognition, laughter detection and apparent personality estimation. The model achieves state-of-the-art results on multimodal emotion recognition and laughter recognition and competitive results for apparent personality estimation, demonstrating the effectiveness of in-domain self-supervised pre-training. Code and model weight are available here https://github.com/HuBohy/SocialMAE.

View Paper