Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence

Jiahao Meng, Xiangtai Li, Haochen Wang, Yue Tan, Tao Zhang, Lingdong Kong, Yunhai Tong, Anran Wang, Zhiyang Teng, Yujing Wang, Zhuochen Wang

2025-10-24

Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence

Summary

This paper introduces a new video reasoning model called Open-o3 Video that doesn't just give answers to questions about videos, but also *shows* where in the video it found the evidence to support its answer, pinpointing both the time and the objects involved.

What's the problem?

Current video reasoning models are like students who can give you the right answer but can't show their work. They generate explanations, but don't link those explanations to specific moments or objects within the video itself. Extending the ability to highlight evidence from images to videos is harder because videos have both time and location to consider – you need to track things moving *and* changing over time.

What's the solution?

The researchers created Open-o3 Video, which highlights key moments (timestamps) and objects (bounding boxes) in a video alongside its answers. To make this possible, they built two new datasets specifically designed to train the model to understand both time and space in videos. They also used a special training method called reinforcement learning, rewarding the model for accurate answers, correctly identifying when things happen, and precisely locating objects.

Why it matters?

Open-o3 Video is a big step forward because it makes video reasoning more transparent and reliable. It doesn't just give you an answer, it shows *why* it gave that answer, which builds trust. Plus, the highlighted evidence can be used to double-check the model's reasoning and improve its accuracy, and it performs better than existing models on several video understanding tasks.

Abstract

Most video reasoning models only generate textual reasoning traces without indicating when and where key evidence appears. Recent models such as OpenAI-o3 have sparked wide interest in evidence-centered reasoning for images, yet extending this ability to videos is more challenging, as it requires joint temporal tracking and spatial localization across dynamic scenes. We introduce Open-o3 Video, a non-agent framework that integrates explicit spatio-temporal evidence into video reasoning, and carefully collect training data and design training strategies to address the aforementioned challenges. The model highlights key timestamps, objects, and bounding boxes alongside its answers, allowing reasoning to be grounded in concrete visual observations. To enable this functionality, we first curate and build two high-quality datasets, STGR-CoT-30k for SFT and STGR-RL-36k for RL, with carefully constructed temporal and spatial annotations, since most existing datasets offer either temporal spans for videos or spatial boxes on images, lacking unified spatio-temporal supervision and reasoning traces. Then, we adopt a cold-start reinforcement learning strategy with multiple specially designed rewards that jointly encourage answer accuracy, temporal alignment, and spatial precision. On V-STAR benchmark, Open-o3 Video achieves state-of-the-art performance, raising mAM by 14.4% and mLGM by 24.2% on the Qwen2.5-VL baseline. Consistent improvements are also observed on a broad range of video understanding benchmarks, including VideoMME, WorldSense, VideoMMMU, and TVGBench. Beyond accuracy, the reasoning traces produced by Open-o3 Video also provide valuable signals for test-time scaling, enabling confidence-aware verification and improving answer reliability.

View Paper