Tool-Augmented Spatiotemporal Reasoning for Streamlining Video Question Answering Task
Sunqi Fan, Jiashuo Cui, Meng-Hao Guo, Shuojin Yang
2025-12-12
Summary
This paper focuses on improving how well artificial intelligence can answer questions about videos, specifically by helping it understand what's happening both in individual scenes and over time.
What's the problem?
Current AI models, even the really big ones, have trouble with Video Question Answering because they struggle to understand both *where* things are in a video (spatial relationships) and *how* things change over time (temporal dynamics). They need to be able to connect what's happening in one part of a video to what happens later, and also pinpoint exactly what areas of the video are important for answering a question, but they often get confused or take shortcuts.
What's the solution?
The researchers created a 'Video Toolkit' – a set of tools that help the AI analyze videos in a more detailed way. They also developed a 'Spatiotemporal Reasoning Framework' (STAR) which acts like a manager, deciding which tool to use and when. STAR carefully plans the order of these tools, first focusing on *where* to look in the video, and then analyzing *how* things change over time in that specific area. This helps the AI avoid making quick, incorrect assumptions and instead build a more thorough understanding.
Why it matters?
This work is a step towards creating AI assistants that can truly 'watch' and understand videos like humans do. This has big implications for things like automated video analysis, helping people with visual impairments, and building more intelligent robots that can interact with the real world.
Abstract
Video Question Answering (VideoQA) task serves as a critical playground for evaluating whether foundation models can effectively perceive, understand, and reason about dynamic real-world scenarios. However, existing Multimodal Large Language Models (MLLMs) struggle with simultaneously modeling spatial relationships within video frames and understanding the causal dynamics of temporal evolution on complex and reasoning-intensive VideoQA task. In this work, we equip MLLM with a comprehensive and extensible Video Toolkit, to enhance MLLM's spatiotemporal reasoning capabilities and ensure the harmony between the quantity and diversity of tools. To better control the tool invocation sequence and avoid toolchain shortcut issues, we propose a Spatiotemporal Reasoning Framework (STAR) that strategically schedules temporal and spatial tools, thereby progressively localizing the key area in the video. Our STAR framework enhances GPT-4o using lightweight tools, achieving an 8.2% gain on VideoMME and 4.6% on LongVideoBench. We believe that our proposed Video Toolkit and STAR framework make an important step towards building autonomous and intelligent video analysis assistants. The code is publicly available at https://github.com/fansunqi/VideoTool.