Video-BrowseComp: Benchmarking Agentic Video Research on Open Web
Zhengyang Liang, Yan Shu, Xiangrui Liu, Minghao Qin, Kaixin Liang, Paolo Rota, Nicu Sebe, Zheng Liu, Lizi Liao
2025-12-30
Summary
This paper focuses on the challenge of building AI agents that can effectively research and understand information from videos on the internet, going beyond simply 'watching' videos to actively investigating them.
What's the problem?
Current AI models are really good at getting information from text and even images, but they struggle with video because most tests just show them pre-selected clips. Real-world video research requires an AI to *find* relevant parts within a video, connect information across different parts of the video, and double-check facts using other sources on the web. Existing benchmarks don't test this kind of active, investigative video research.
What's the solution?
The researchers created a new test called Video-BrowseComp. This test gives AI agents questions that can *only* be answered by carefully watching videos, finding specific moments within those videos, and then using the internet to verify information. It forces the AI to act like a researcher, not just a viewer. They then tested several advanced AI models, including a very powerful one called GPT-5.1 with search capabilities, on this new test.
Why it matters?
The results showed that even the best AI models aren't very good at this type of video research, often relying on text descriptions instead of actually 'seeing' and understanding the video content. This new benchmark highlights a major weakness in current AI and pushes the field towards building agents that can truly reason about and learn from the dynamic world of online video.
Abstract
The evolution of autonomous agents is redefining information seeking, transitioning from passive retrieval to proactive, open-ended web research. However, while textual and static multimodal agents have seen rapid progress, a significant modality gap remains in processing the web's most dynamic modality: video. Existing video benchmarks predominantly focus on passive perception, feeding curated clips to models without requiring external retrieval. They fail to evaluate agentic video research, which necessitates actively interrogating video timelines, cross-referencing dispersed evidence, and verifying claims against the open web. To bridge this gap, we present Video-BrowseComp, a challenging benchmark comprising 210 questions tailored for open-web agentic video reasoning. Unlike prior benchmarks, Video-BrowseComp enforces a mandatory dependency on temporal visual evidence, ensuring that answers cannot be derived solely through text search but require navigating video timelines to verify external claims. Our evaluation of state-of-the-art models reveals a critical bottleneck: even advanced search-augmented models like GPT-5.1 (w/ Search) achieve only 15.24\% accuracy. Our analysis reveals that these models largely rely on textual proxies, excelling in metadata-rich domains (e.g., TV shows with plot summaries) but collapsing in metadata-sparse, dynamic environments (e.g., sports, gameplay) where visual grounding is essential. As the first open-web video research benchmark, Video-BrowseComp advances the field beyond passive perception toward proactive video reasoning.