EmoNet-Voice: A Fine-Grained, Expert-Verified Benchmark for Speech Emotion Detection
Christoph Schuhmann, Robert Kaczmarczyk, Gollam Rabby, Felix Friedrich, Maurice Kraus, Kourosh Nadi, Huu Nguyen, Kristian Kersting, Sören Auer
2025-06-20
Summary
This paper talks about EmoNet-Voice, a new resource that helps improve how AI systems detect emotions in speech by offering a large and detailed dataset with expert-verified labels using synthetic, privacy-safe audio.
What's the problem?
The problem is that current speech emotion recognition datasets often lack detailed emotional categories, raise privacy concerns, or use acted speech that may not reflect real emotions, limiting AI's ability to learn and evaluate emotions accurately.
What's the solution?
The researchers created EmoNet-Voice, which includes a large synthetic pre-training dataset with thousands of hours of speech across multiple voices, emotions, and languages, and a carefully annotated benchmark dataset verified by psychology experts. This synthetic data simulates actors portraying specific emotions, allowing for fine-grained emotion detection and intensity measurement while protecting privacy.
Why it matters?
This matters because better emotion recognition in speech can make AI systems more empathetic and effective in applications like virtual assistants, mental health monitoring, and human-computer interaction, all while respecting user privacy.
Abstract
EmoNet-Voice, a new resource with large pre-training and benchmark datasets, advances speech emotion recognition by offering fine-grained emotion evaluation with synthetic, privacy-preserving audio.