EmoVid: A Multimodal Emotion Video Dataset for Emotion-Centric Video Understanding and Generation

Zongyang Qiu, Bingyuan Wang, Xingbei Chen, Yingqing He, Zeyu Wang

2025-11-17

EmoVid: A Multimodal Emotion Video Dataset for Emotion-Centric Video Understanding and Generation

Summary

This research focuses on making videos that not only *look* good, but also *feel* a certain way emotionally. Current video creation technology is really good at details like color and sharpness, but doesn't pay much attention to how the video makes viewers feel.

What's the problem?

Existing systems struggle to create videos with specific emotions because there's a lack of resources – like labeled examples – that connect how a video looks to the emotions it evokes. It's especially hard to do this with videos that aren't trying to be realistic, like cartoons or animated clips, because the usual rules about what makes something look 'happy' or 'sad' don't necessarily apply.

What's the solution?

The researchers created a new dataset called EmoVid, which is a collection of videos (cartoons, movie scenes, stickers) that have been tagged with emotion labels, descriptions of their visual qualities (like brightness and color), and text captions. By studying this dataset, they figured out how visual features relate to emotional perception. Then, they used this knowledge to improve a video generation model, Wan2.1, so it could create videos that better match a desired emotion, whether starting from text or an image.

Why it matters?

This work is important because it provides a new way to measure and create videos that effectively convey emotions. It’s a step towards making video generation more expressive and allows for more control over the emotional impact of the content, which is useful for things like animation, marketing, and even just making more engaging videos.

Abstract

Emotion plays a pivotal role in video-based expression, but existing video generation systems predominantly focus on low-level visual metrics while neglecting affective dimensions. Although emotion analysis has made progress in the visual domain, the video community lacks dedicated resources to bridge emotion understanding with generative tasks, particularly for stylized and non-realistic contexts. To address this gap, we introduce EmoVid, the first multimodal, emotion-annotated video dataset specifically designed for creative media, which includes cartoon animations, movie clips, and animated stickers. Each video is annotated with emotion labels, visual attributes (brightness, colorfulness, hue), and text captions. Through systematic analysis, we uncover spatial and temporal patterns linking visual features to emotional perceptions across diverse video forms. Building on these insights, we develop an emotion-conditioned video generation technique by fine-tuning the Wan2.1 model. The results show a significant improvement in both quantitative metrics and the visual quality of generated videos for text-to-video and image-to-video tasks. EmoVid establishes a new benchmark for affective video computing. Our work not only offers valuable insights into visual emotion analysis in artistically styled videos, but also provides practical methods for enhancing emotional expression in video generation.

View Paper