Revisiting Modeling and Evaluation Approaches in Speech Emotion Recognition: Considering Subjectivity of Annotators and Ambiguity of Emotions
Huang-Cheng Chou, Chi-Chun Lee
2025-10-08
Summary
This research focuses on how computers can better understand emotions in speech, a field called speech emotion recognition. It challenges the common practice of simplifying emotion analysis by ignoring disagreements between people when labeling emotional speech.
What's the problem?
Currently, when training computers to recognize emotions, researchers rely on people to label recordings with emotions like 'happy' or 'sad'. However, people often disagree on what emotion is being expressed. Traditional methods just pick the most common label, essentially throwing away valuable information and assuming there's only one 'right' answer, which doesn't reflect how humans actually experience and perceive emotions which are often complex and overlapping.
What's the solution?
The researchers propose a new approach that embraces this disagreement. Instead of forcing a single label, they keep all the different opinions and represent them as probabilities. They also developed a way to evaluate the system that allows for multiple emotions to be present at the same time, like someone feeling both sad and angry. Finally, they created a method to gently discourage the system from predicting emotion combinations that people rarely agree on. They tested these ideas on several speech datasets and found they improved accuracy.
Why it matters?
This work is important because it makes speech emotion recognition systems more realistic and aligned with how humans actually perceive emotions. By acknowledging the subjectivity of emotion and allowing for multiple, even conflicting, interpretations, the systems become more robust and better at understanding the nuances of human speech. This could lead to more effective applications in areas like mental health support, customer service, and human-computer interaction.
Abstract
Over the past two decades, speech emotion recognition (SER) has received growing attention. To train SER systems, researchers collect emotional speech databases annotated by crowdsourced or in-house raters who select emotions from predefined categories. However, disagreements among raters are common. Conventional methods treat these disagreements as noise, aggregating labels into a single consensus target. While this simplifies SER as a single-label task, it ignores the inherent subjectivity of human emotion perception. This dissertation challenges such assumptions and asks: (1) Should minority emotional ratings be discarded? (2) Should SER systems learn from only a few individuals' perceptions? (3) Should SER systems predict only one emotion per sample? Psychological studies show that emotion perception is subjective and ambiguous, with overlapping emotional boundaries. We propose new modeling and evaluation perspectives: (1) Retain all emotional ratings and represent them with soft-label distributions. Models trained on individual annotator ratings and jointly optimized with standard SER systems improve performance on consensus-labeled tests. (2) Redefine SER evaluation by including all emotional data and allowing co-occurring emotions (e.g., sad and angry). We propose an ``all-inclusive rule'' that aggregates all ratings to maximize diversity in label representation. Experiments on four English emotion databases show superior performance over majority and plurality labeling. (3) Construct a penalization matrix to discourage unlikely emotion combinations during training. Integrating it into loss functions further improves performance. Overall, embracing minority ratings, multiple annotators, and multi-emotion predictions yields more robust and human-aligned SER systems.