IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs

David Ma, Yuanxing Zhang, Jincheng Ren, Jarvis Guo, Yifan Yao, Zhenlin Wei, Zhenzhu Yang, Zhongyuan Peng, Boyu Feng, Jun Ma, Xiao Gu, Zhoufutu Wen, King Zhu, Yancheng He, Meng Cao, Shiwen Ni, Jiaheng Liu, Wenhao Huang, Ge Zhang, Xiaojie Jin

2025-04-23

IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning
in Multimodal LLMs

Summary

This paper talks about IV-Bench, a new test designed to see how well AI models can understand and reason about videos when they also have to use information from related images.

What's the problem?

The problem is that while AI models are getting better at understanding images and videos separately, they still struggle when they have to combine information from both at the same time, especially for more complicated reasoning tasks. There hasn’t been a good way to measure how well these models actually do at this kind of challenge.

What's the solution?

The researchers created IV-Bench, a special benchmark that tests AI models on tasks where they have to use both images and videos together to answer questions or solve problems. When they used this benchmark, they found that even the best models had a hard time and didn’t perform as well as expected, especially on tougher tasks.

Why it matters?

This matters because it shows there’s still a lot of work to do before AI can truly understand and reason about the real world like humans do, and having a good test like IV-Bench helps researchers know where to focus their efforts to make smarter, more capable AI.

Abstract

IV-Bench evaluates Image-Grounded Video Perception and Reasoning in MLLMs, revealing substantial underperformance across multiple tasks and factors influencing model accuracy.

View Paper