R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcing Learning

Jiaxing Zhao, Xihan Wei, Liefeng Bo

2025-03-10

R1-Omni: Explainable Omni-Multimodal Emotion Recognition with
Reinforcing Learning

Summary

This paper talks about R1-Omni, a new AI model that uses a technique called Reinforcement Learning with Verifiable Reward (RLVR) to improve how computers understand emotions in both videos and audio

What's the problem?

Current AI models struggle to accurately recognize emotions from both visual and audio information, and they often can't explain how they came to their conclusions. This makes it hard to trust or improve these systems

What's the solution?

The researchers created R1-Omni by applying RLVR to a type of AI called an Omni-multimodal large language model. This new approach helped the AI get better at three important things: reasoning about emotions, accurately recognizing emotions, and working well with new types of data it hadn't seen before. The model can now explain how it uses both visual and audio clues to figure out emotions

Why it matters?

This matters because it could lead to AI systems that are much better at understanding human emotions in real-world situations. These improved AIs could be used in fields like mental health, customer service, or education to better respond to people's emotional needs. It also helps researchers understand how to make AI systems that can explain their decisions, which is important for building trust in AI technology

Abstract

In this work, we present the first application of Reinforcement Learning with Verifiable Reward (RLVR) to an Omni-multimodal large language model in the context of emotion recognition, a task where both visual and audio modalities play crucial roles. We leverage RLVR to optimize the Omni model, significantly enhancing its performance in three key aspects: reasoning capability, emotion recognition accuracy, and generalization ability. The introduction of RLVR not only improves the model's overall performance on in-distribution data but also demonstrates superior robustness when evaluated on out-of-distribution datasets. More importantly, the improved reasoning capability enables clear analysis of the contributions of different modalities, particularly visual and audio information, in the emotion recognition process. This provides valuable insights into the optimization of multimodal large language models.

View Paper