VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection

Songhao Han, Wei Huang, Hairong Shi, Le Zhuo, Xiu Su, Shifeng Zhang, Xu Zhou, Xiaojuan Qi, Yue Liao, Si Liu

2024-11-25

VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection

Summary

This paper presents VideoEspresso, a new dataset designed to improve video reasoning tasks by providing high-quality question-answer pairs that help models understand and analyze videos better.

What's the problem?

Video reasoning tasks require models to answer questions about videos, but existing datasets often lack quality and detail. Many current video question-answering (VideoQA) datasets are either too simplistic or rely on expensive manual labeling, making them difficult to scale and use effectively for complex reasoning tasks. This limits the ability of models to learn how to reason about videos accurately.

What's the solution?

To solve these issues, the authors created VideoEspresso, which includes carefully selected video frames that maintain important details and coherence over time. They developed a method to reduce redundant frames and generated question-answer pairs using a powerful language model called GPT-4o. Additionally, they introduced Chain-of-Thought (CoT) annotations that guide models in understanding the logical relationships within the video content. The dataset is structured to support advanced reasoning by focusing on core frames and multimodal evidence.

Why it matters?

This research is significant because it provides a robust framework for evaluating and improving how models reason about videos. By offering a high-quality dataset with detailed annotations, VideoEspresso can help advance the capabilities of AI in understanding complex video content, which is important for applications in entertainment, education, and beyond. The release of this dataset will also encourage further research in the field of video reasoning.

Abstract

The advancement of Large Vision Language Models (LVLMs) has significantly improved multimodal understanding, yet challenges remain in video reasoning tasks due to the scarcity of high-quality, large-scale datasets. Existing video question-answering (VideoQA) datasets often rely on costly manual annotations with insufficient granularity or automatic construction methods with redundant frame-by-frame analysis, limiting their scalability and effectiveness for complex reasoning. To address these challenges, we introduce VideoEspresso, a novel dataset that features VideoQA pairs preserving essential spatial details and temporal coherence, along with multimodal annotations of intermediate reasoning steps. Our construction pipeline employs a semantic-aware method to reduce redundancy, followed by generating QA pairs using GPT-4o. We further develop video Chain-of-Thought (CoT) annotations to enrich reasoning processes, guiding GPT-4o in extracting logical relationships from QA pairs and video content. To exploit the potential of high-quality VideoQA pairs, we propose a Hybrid LVLMs Collaboration framework, featuring a Frame Selector and a two-stage instruction fine-tuned reasoning LVLM. This framework adaptively selects core frames and performs CoT reasoning using multimodal evidence. Evaluated on our proposed benchmark with 14 tasks against 9 popular LVLMs, our method outperforms existing baselines on most tasks, demonstrating superior video reasoning capabilities. Our code and dataset will be released at: https://github.com/hshjerry/VideoEspresso

View Paper