Zoom-Zero: Reinforced Coarse-to-Fine Video Understanding via Temporal Zoom-in
Xiaoqian Shen, Min-Hung Chen, Yu-Chiang Frank Wang, Mohamed Elhoseiny, Ryo Hachiuma
2025-12-17
Summary
This paper focuses on improving how well computers can answer questions about videos, specifically by making sure they focus on the *right* parts of the video when finding the answer.
What's the problem?
Current systems, even advanced ones using large language models, often struggle to pinpoint the exact moments in a video that are relevant to a question. They might look at the wrong sections or even 'hallucinate' information not actually present in the video, leading to incorrect answers. Existing methods trying to fix this, like Group Relative Policy Optimization, haven't fully solved the problem of accurately connecting the answer to the visual evidence.
What's the solution?
The researchers developed a new approach called Zoom-Zero. It works in two steps: first, it broadly identifies the sections of the video that *might* contain the answer. Then, it 'zooms in' on the most important frames within those sections to verify the answer and make sure it's grounded in what's actually happening visually. They also improved how the system learns by rewarding it for accurate zooming and by carefully assigning credit to the parts of the system responsible for finding the right video segments and generating the answer.
Why it matters?
This work is important because it significantly improves the accuracy of video question answering systems. By better understanding *when* to look in a video, the system provides more reliable answers and can even handle longer videos more effectively, preserving important details without getting lost in the overall context. This advancement brings us closer to AI that can truly 'understand' and interact with video content.
Abstract
Grounded video question answering (GVQA) aims to localize relevant temporal segments in videos and generate accurate answers to a given question; however, large video-language models (LVLMs) exhibit limited temporal awareness. Although existing approaches based on Group Relative Policy Optimization (GRPO) attempt to improve temporal grounding, they still struggle to faithfully ground their answers in the relevant video evidence, leading to temporal mislocalization and hallucinations. In this work, we present Zoom-Zero, a coarse-to-fine framework that first localizes query-relevant segments and then temporally zooms into the most salient frames for finer-grained visual verification. Our method addresses the limits of GRPO for the GVQA task with two key innovations: (i) a zoom-in accuracy reward that validates the fidelity of temporal grounding prediction and facilitates fine-grained visual verification on grounded frames; (ii) token-selective credit assignment, which attributes rewards to the tokens responsible for temporal localization or answer generation, mitigating GRPO's issue in handling multi-faceted reward signals. Our proposed method advances grounded video question answering, improving temporal grounding by 5.2\% on NExT-GQA and 4.6\% on ReXTime, while also enhancing average answer accuracy by 2.4\%. Additionally, the coarse-to-fine zoom-in during inference further benefits long-form video understanding by preserving critical visual details without compromising global context, yielding an average improvement of 6.4\% on long-video benchmarks.