Vinoground: Scrutinizing LMMs over Dense Temporal Reasoning with Short Videos
Jianrui Zhang, Mu Cai, Yong Jae Lee
2024-10-04

Summary
This paper introduces Vinoground, a new evaluation benchmark designed to test how well large multimodal models (LMMs) understand short videos, particularly focusing on their ability to reason about the timing of actions and changes in objects.
What's the problem?
While many believe that modern LMMs have improved significantly in understanding short videos, this paper argues that these models still struggle with basic reasoning tasks. Specifically, they have difficulty recognizing the differences in timing between various actions and how objects change over time. This limitation can lead to inaccurate interpretations of video content, which is crucial for applications like video analysis and content creation.
What's the solution?
To investigate this issue, the authors created Vinoground, a dataset containing 1,000 pairs of short videos and their captions. They used this dataset to evaluate existing LMMs and found that even the best model, GPT-4o, only scored about 50% accuracy on tasks involving video understanding, compared to a human baseline of around 90%. The study shows that many open-source models performed poorly, often just guessing answers rather than using reasoning skills.
Why it matters?
This research is important because it highlights the ongoing challenges in developing AI that can accurately understand and analyze video content. By identifying these gaps in reasoning capabilities, Vinoground serves as a tool for future improvements in LMMs, ultimately leading to better performance in tasks that require understanding complex video information.
Abstract
There has been growing sentiment recently that modern large multimodal models (LMMs) have addressed most of the key challenges related to short video comprehension. As a result, both academia and industry are gradually shifting their attention towards the more complex challenges posed by understanding long-form videos. However, is this really the case? Our studies indicate that LMMs still lack many fundamental reasoning capabilities even when dealing with short videos. We introduce Vinoground, a temporal counterfactual LMM evaluation benchmark encompassing 1000 short and natural video-caption pairs. We demonstrate that existing LMMs severely struggle to distinguish temporal differences between different actions and object transformations. For example, the best model GPT-4o only obtains ~50% on our text and video scores, showing a large gap compared to the human baseline of ~90%. All open-source multimodal models and CLIP-based models perform much worse, producing mostly random chance performance. Through this work, we shed light onto the fact that temporal reasoning in short videos is a problem yet to be fully solved. The dataset and evaluation code are available at https://vinoground.github.io.