LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling

Zuhao Yang, Sudong Wang, Kaichen Zhang, Keming Wu, Sicong Leng, Yifan Zhang, Chengwei Qin, Shijian Lu, Xingxuan Li, Lidong Bing

2025-12-02

LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling

Summary

This paper introduces a new system called LongVT designed to help AI better understand and reason about long videos, specifically by reducing the tendency for these AI systems to 'hallucinate' or make things up when information is limited.

What's the problem?

Current AI models, even advanced ones, struggle with long videos because it's hard to pinpoint relevant information when the video is lengthy and key details are spread out over time. They often guess or invent details, leading to inaccurate conclusions. Essentially, they lack a good strategy for focusing on the important parts of a long video to answer questions.

What's the solution?

The researchers created LongVT, which mimics how humans watch videos. Instead of trying to process everything at once, it first gets a general overview, then 'zooms in' on specific clips that seem relevant to the question being asked. It repeatedly focuses on these important sections, refining its understanding until it can confidently answer based on actual visual evidence. They also created a new dataset, VideoSIAH, to help train and test these kinds of systems.

Why it matters?

This work is important because it improves the reliability of AI when dealing with video data. Better video understanding has many applications, like automatically summarizing videos, answering questions about events in videos, and even helping robots understand their surroundings. By reducing hallucinations, LongVT makes AI more trustworthy and useful in these scenarios.

Abstract

Large multimodal models (LMMs) have shown great potential for video reasoning with textual Chain-of-Thought. However, they remain vulnerable to hallucinations, especially when processing long-form videos where evidence is sparse and temporally dispersed. Inspired by how humans comprehend long videos - by first skimming globally and then examining relevant clips for details - we introduce LongVT, an end-to-end agentic framework that enables "Thinking with Long Videos" via interleaved Multimodal Chain-of-Tool-Thought. Specifically, we exploit LMMs' inherent temporal grounding ability as a native video cropping tool to zoom in on a specific video clip and resample finer-grained video frames. This global-to-local reasoning loop continues until answers are grounded in retrieved visual evidence. Given the scarcity of fine-grained question-answering (QA) data for the long video reasoning task, we curate and will release a data suite named VideoSIAH to facilitate both training and evaluation. Specifically, our training dataset consists of 247.9K samples for tool-integrated cold-start supervised fine-tuning, 1.6K samples for agentic reinforcement learning, and 15.4K samples for agentic reinforcement fine-tuning, respectively. Our evaluation benchmark consists of 1,280 QA pairs that are carefully curated through a semi-automatic data pipeline with human-in-the-loop validation. With a meticulously designed three-stage training strategy and extensive empirical validation, LongVT consistently outperforms existing strong baselines across four challenging long-video understanding and reasoning benchmarks. Our codes, data, and model checkpoints are publicly available at https://github.com/EvolvingLMMs-Lab/LongVT .

View Paper