VideoDetective: Clue Hunting via both Extrinsic Query and Intrinsic Relevance for Long Video Understanding
Ruoliu Yang, Chu Wu, Caifeng Shan, Ran He, Chaoyou Fu
2026-03-24
Summary
This paper introduces a new method called VideoDetective to help computers better understand long videos when answering questions about them.
What's the problem?
Large language models, even those that can process both text and video, struggle with long videos because they can't 'remember' everything that happens. They need to quickly find the most important parts of the video that relate to a specific question, but current methods only focus on how relevant each part is *to the question itself*, ignoring how different parts of the video connect to each other and how important those connections might be.
What's the solution?
VideoDetective tackles this by first breaking down the video into segments and then creating a 'map' showing how similar each segment is to others, both visually and in terms of when they happen in the video. It then uses a process of guessing, checking, and improving to figure out which segments are most relevant to the question. Importantly, it doesn't just look at each segment in isolation; it uses the 'map' to spread relevance scores between connected segments, helping it find clues it might have missed. This allows the model to focus on only the most crucial parts of the video when answering.
Why it matters?
This research is important because it significantly improves the ability of AI to understand long videos, leading to more accurate answers to questions about them. The improvements, up to 7.5% on a standard test, mean we're getting closer to AI systems that can truly 'watch' and comprehend videos like humans do, which has implications for many applications like video search, automated content analysis, and more.
Abstract
Long video understanding remains challenging for multimodal large language models (MLLMs) due to limited context windows, which necessitate identifying sparse query-relevant video segments. However, existing methods predominantly localize clues based solely on the query, overlooking the video's intrinsic structure and varying relevance across segments. To address this, we propose VideoDetective, a framework that integrates query-to-segment relevance and inter-segment affinity for effective clue hunting in long-video question answering. Specifically, we divide a video into various segments and represent them as a visual-temporal affinity graph built from visual similarity and temporal proximity. We then perform a Hypothesis-Verification-Refinement loop to estimate relevance scores of observed segments to the query and propagate them to unseen segments, yielding a global relevance distribution that guides the localization of the most critical segments for final answering with sparse observation. Experiments show our method consistently achieves substantial gains across a wide range of mainstream MLLMs on representative benchmarks, with accuracy improvements of up to 7.5% on VideoMME-long. Our code is available at https://videodetective.github.io/