HERBench: A Benchmark for Multi-Evidence Integration in Video Question Answering
Dan Ben-Ami, Gabriele Serussi, Kobi Cohen, Chaim Baskin
2025-12-22
Summary
This paper introduces a new challenge for video understanding AI, called HERBench, designed to test if these AIs can truly *reason* about videos over time, not just spot obvious clues.
What's the problem?
Current tests for video question answering are too easy because they often let AI answer questions by focusing on just one short part of the video. This doesn't check if the AI can actually combine information from different moments in the video to understand what's happening. Basically, the AI could be guessing based on a single scene instead of truly understanding the whole video's story.
What's the solution?
The researchers created HERBench, a dataset of 26,000 video questions that *require* the AI to look at multiple, separate parts of a video to get the answer right. They also came up with a way to measure how much of the video an AI needs to look at – the 'Minimum Required Frame-Set'. They then tested 13 of the best current video AI models on HERBench and found they performed poorly, only slightly better than random guessing. They pinpointed two main reasons for this failure: the AI struggles to find the important parts of the video and then struggles to combine the information from those parts.
Why it matters?
HERBench provides a much more difficult and realistic test for video understanding AI. It highlights that current AI models aren't very good at reasoning about videos over time and identifies specific areas – finding evidence and combining information – where improvements are needed. This new benchmark will help researchers build AI that can truly 'watch' and understand videos like humans do.
Abstract
Video Large Language Models (Video-LLMs) are rapidly improving, yet current Video Question Answering (VideoQA) benchmarks often allow questions to be answered from a single salient cue, under-testing reasoning that must aggregate multiple, temporally separated visual evidence. We present HERBench, a VideoQA benchmark purpose-built to assess multi-evidence integration across time. Each question requires aggregating at least three non-overlapping evidential cues across distinct video segments, so neither language priors nor a single snapshot can suffice. HERBench comprises 26K five-way multiple-choice questions organized into twelve compositional tasks that probe identity binding, cross-entity relations, temporal ordering, co-occurrence verification, and counting. To make evidential demand measurable, we introduce the Minimum Required Frame-Set (MRFS), the smallest number of frames a model must fuse to answer correctly, and show that HERBench imposes substantially higher demand than prior datasets (mean MRFS 5.5 vs. 2.6-4.2). Evaluating 13 state-of-the-art Video-LLMs on HERBench reveals pervasive failures: accuracies of 31-42% are only slightly above the 20% random-guess baseline. We disentangle this failure into two critical bottlenecks: (1) a retrieval deficit, where frame selectors overlook key evidence, and (2) a fusion deficit, where models fail to integrate information even when all necessary evidence is provided. By making cross-time evidence both unavoidable and quantifiable, HERBench establishes a principled target for advancing robust, compositional video understanding.