Watching, Reasoning, and Searching: A Video Deep Research Benchmark on Open Web for Agentic Video Reasoning

Chengwen Liu, Xiaomin Yu, Zhuoyue Chang, Zhe Huang, Shuo Zhang, Heng Lian, Kunyi Wang, Rui Xu, Sen Hu, Jianheng Hou, Hao Peng, Chengwei Qin, Xiaobin Hu, Hong Peng, Ronghao Chen, Huacan Wang

2026-01-13

Watching, Reasoning, and Searching: A Video Deep Research Benchmark on Open Web for Agentic Video Reasoning

Summary

This paper introduces a new challenge for artificial intelligence systems that answer questions about videos, specifically when the answer isn't directly *in* the video and requires searching the internet for information.

What's the problem?

Current AI models struggle with video question answering when the clues to the answer are spread throughout a video and the final answer requires information found online. It's hard for these systems to pinpoint important visual cues in a video, then use those cues to search the web effectively, and finally, piece together information from both the video and the web to confidently answer the question. They often lose track of what they were originally looking for while searching.

What's the solution?

The researchers created a new dataset called VideoDR, which contains videos and questions that *require* searching the internet to answer. They then tested different AI approaches – one where the AI follows a set workflow and another where the AI acts more like an independent agent – to see how well they performed on this dataset. They found that simply letting the AI act as an agent wasn't always better; it depended on how well the AI could remember the initial visual clues from the video during its web search.

Why it matters?

This work is important because it highlights the difficulties in building AI systems that can truly 'understand' videos and use external knowledge to answer complex questions. It provides a standard benchmark for researchers to test and improve these systems, and it points to the key areas – like staying focused on the original question and maintaining consistency over long searches – that need the most attention.

Abstract

In real-world video question answering scenarios, videos often provide only localized visual cues, while verifiable answers are distributed across the open web; models therefore need to jointly perform cross-frame clue extraction, iterative retrieval, and multi-hop reasoning-based verification. To bridge this gap, we construct the first video deep research benchmark, VideoDR. VideoDR centers on video-conditioned open-domain video question answering, requiring cross-frame visual anchor extraction, interactive web retrieval, and multi-hop reasoning over joint video-web evidence; through rigorous human annotation and quality control, we obtain high-quality video deep research samples spanning six semantic domains. We evaluate multiple closed-source and open-source multimodal large language models under both the Workflow and Agentic paradigms, and the results show that Agentic is not consistently superior to Workflow: its gains depend on a model's ability to maintain the initial video anchors over long retrieval chains. Further analysis indicates that goal drift and long-horizon consistency are the core bottlenecks. In sum, VideoDR provides a systematic benchmark for studying video agents in open-web settings and reveals the key challenges for next-generation video deep research agents.

View Paper