VidEmo: Affective-Tree Reasoning for Emotion-Centric Video Foundation Models

Zhicheng Zhang, Weicheng Wang, Yongjie Zhu, Wenyu Qin, Pengfei Wan, Di Zhang, Jufeng Yang

2025-11-05

VidEmo: Affective-Tree Reasoning for Emotion-Centric Video Foundation Models

Summary

This paper focuses on improving how computers understand and predict emotions shown in videos, building on recent advances in video large language models.

What's the problem?

Understanding emotions in videos is tricky because emotions are complex and change over time, relying on many different clues. Current methods struggle to accurately grasp these evolving emotional states and explain *why* they think a certain emotion is being displayed. It's hard for computers to reason about emotions like humans do.

What's the solution?

The researchers created a new system that breaks down emotion understanding into steps: first, recognizing basic visual elements, then analyzing facial expressions, and finally, putting it all together to understand the overall emotion. They developed special 'VidEmo' models, trained in two phases. First, they taught the models about emotions directly, and then they used a technique called 'affective-tree reinforcement learning' to help them learn to reason about emotions based on instructions. They also built a large dataset called 'Emo-CFG' with over 2 million examples, including questions about emotions and explanations for the answers, to help train and test these models.

Why it matters?

This work is important because it pushes the boundaries of what computers can do in terms of understanding human emotions from videos. The new models and dataset achieve better results than previous methods on various tasks related to facial expression recognition and emotion analysis, paving the way for more sophisticated and empathetic AI systems.

Abstract

Understanding and predicting emotion from videos has gathered significant attention in recent studies, driven by advancements in video large language models (VideoLLMs). While advanced methods have made progress in video emotion analysis, the intrinsic nature of emotions poses significant challenges. Emotions are characterized by dynamic and cues-dependent properties, making it difficult to understand complex and evolving emotional states with reasonable rationale. To tackle these challenges, we propose a novel affective cues-guided reasoning framework that unifies fundamental attribute perception, expression analysis, and high-level emotional understanding in a stage-wise manner. At the core of our approach is a family of video emotion foundation models (VidEmo), specifically designed for emotion reasoning and instruction-following. These models undergo a two-stage tuning process: first, curriculum emotion learning for injecting emotion knowledge, followed by affective-tree reinforcement learning for emotion reasoning. Moreover, we establish a foundational data infrastructure and introduce a emotion-centric fine-grained dataset (Emo-CFG) consisting of 2.1M diverse instruction-based samples. Emo-CFG includes explainable emotional question-answering, fine-grained captions, and associated rationales, providing essential resources for advancing emotion understanding tasks. Experimental results demonstrate that our approach achieves competitive performance, setting a new milestone across 15 face perception tasks.

View Paper