TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models

Mu Cai, Reuben Tan, Jianrui Zhang, Bocheng Zou, Kai Zhang, Feng Yao, Fangrui Zhu, Jing Gu, Yiwu Zhong, Yuzhang Shang, Yao Dou, Jaden Park, Jianfeng Gao, Yong Jae Lee, Jianwei Yang

2024-10-15

TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models

Summary

This paper introduces TemporalBench, a new benchmark designed to test how well video models understand and interpret the timing of events in videos.

What's the problem?

Understanding the timing and order of actions in videos is important for AI models that analyze video content. However, most existing benchmarks focus on static images and do not effectively evaluate how well these models grasp fine details about timing in videos. This lack of detailed testing makes it hard to know how good these models really are at understanding video dynamics.

What's the solution?

TemporalBench addresses this issue by providing around 10,000 video question-answer pairs that are based on detailed human annotations of video clips. This benchmark allows for a thorough assessment of various aspects of temporal understanding, such as how often actions occur, the intensity of movements, and the order of events. It also supports different tasks like answering questions about videos and generating captions, making it versatile for evaluating multiple types of video models.

Why it matters?

This research is significant because it sets a higher standard for evaluating AI's ability to understand complex video content. By focusing on fine-grained temporal dynamics, TemporalBench can help improve the development of AI systems that need to analyze and generate video content accurately, which is valuable in fields like entertainment, education, and security.

Abstract

Understanding fine-grained temporal dynamics is crucial for multimodal video comprehension and generation. Due to the lack of fine-grained temporal annotations, existing video benchmarks mostly resemble static image benchmarks and are incompetent at evaluating models for temporal understanding. In this paper, we introduce TemporalBench, a new benchmark dedicated to evaluating fine-grained temporal understanding in videos. TemporalBench consists of ~10K video question-answer pairs, derived from ~2K high-quality human annotations detailing the temporal dynamics in video clips. As a result, our benchmark provides a unique testbed for evaluating various temporal understanding and reasoning abilities such as action frequency, motion magnitude, event order, etc. Moreover, it enables evaluations on various tasks like both video question answering and captioning, both short and long video understanding, as well as different models such as multimodal video embedding models and text generation models. Results show that state-of-the-art models like GPT-4o achieve only 38.5% question answering accuracy on TemporalBench, demonstrating a significant gap (~30%) between humans and AI in temporal understanding. Furthermore, we notice a critical pitfall for multi-choice QA where LLMs can detect the subtle changes in negative captions and find a centralized description as a cue for its prediction, where we propose Multiple Binary Accuracy (MBA) to correct such bias. We hope that TemporalBench can foster research on improving models' temporal reasoning capabilities. Both dataset and evaluation code will be made available.

View Paper