Breaking Down Video LLM Benchmarks: Knowledge, Spatial Perception, or True Temporal Understanding?

Bo Feng, Zhengfeng Lai, Shiyu Li, Zizhen Wang, Simon Wang, Ping Huang, Meng Cao

2025-05-30

Breaking Down Video LLM Benchmarks: Knowledge, Spatial Perception, or
True Temporal Understanding?

Summary

This paper talks about VBenchComp, a new system that sorts questions about videos into different types so researchers can see exactly what AI models are good at and where they struggle when understanding videos.

What's the problem?

The problem is that most tests for video AI models just give an overall score, which doesn't show whether the model is struggling with understanding timing, space, or general knowledge. This makes it hard to know what needs to be improved.

What's the solution?

The researchers created VBenchComp, which automatically groups video-related questions into categories like temporal reasoning, spatial perception, and factual knowledge. This lets them pinpoint the specific areas where an AI model is weak or strong, instead of just looking at one big score.

Why it matters?

This is important because it helps developers make AI models that are better at understanding videos in a deeper way, leading to smarter video assistants, better content analysis, and more reliable technology for things like security cameras or educational tools.

Abstract

VBenchComp, an automated pipeline, categorizes video LLM questions into different domains to evaluate temporal reasoning and isolate model weaknesses beyond overall scores.

View Paper