Unleashing the Potential of Multimodal LLMs for Zero-Shot Spatio-Temporal Video Grounding

Zaiquan Yang, Yuhao Liu, Gerhard Hancke, Rynson W. H. Lau

2025-09-19

Unleashing the Potential of Multimodal LLMs for Zero-Shot Spatio-Temporal Video Grounding

Summary

This paper focuses on a problem in video understanding called spatio-temporal video grounding, which means pinpointing *where* and *when* something happens in a video based on a text description. The researchers explored using powerful AI models called multimodal large language models (MLLMs) to do this without needing specific training for the task, a 'zero-shot' approach.

What's the problem?

Current MLLMs, while good at understanding both text and images, aren't always great at accurately connecting a text description to the specific part of a video it refers to. The paper identifies two main issues: first, these models seem to create their own internal 'tags' to help with this connection, but these aren't always effective. Second, the models struggle to fully use all the information in the text, like details about *what* an object is or *what* action is happening, to find the right video segment.

What's the solution?

To fix this, the researchers created a new system built around MLLMs. They broke down the text description into smaller parts focusing on attributes (like color or size) and actions. Then, they used a technique called 'logit-guided re-attention' to help the model focus on the most important visual cues related to those attributes and actions in the video. To make sure the identified location makes sense over time, they also added a 'temporal-augmented assembling' strategy, which considers both the original video and slightly altered versions to ensure consistency in the identified video segment.

Why it matters?

This work is important because it shows how to get these large AI models to perform a complex video understanding task without needing a lot of specialized training data. This 'zero-shot' capability is valuable because gathering and labeling video data is expensive and time-consuming. By improving the accuracy of video grounding, this research could help with applications like video editing, content retrieval, and even robotics where machines need to understand what's happening in the visual world.

Abstract

Spatio-temporal video grounding (STVG) aims at localizing the spatio-temporal tube of a video, as specified by the input text query. In this paper, we utilize multimodal large language models (MLLMs) to explore a zero-shot solution in STVG. We reveal two key insights about MLLMs: (1) MLLMs tend to dynamically assign special tokens, referred to as grounding tokens, for grounding the text query; and (2) MLLMs often suffer from suboptimal grounding due to the inability to fully integrate the cues in the text query (e.g., attributes, actions) for inference. Based on these insights, we propose a MLLM-based zero-shot framework for STVG, which includes novel decomposed spatio-temporal highlighting (DSTH) and temporal-augmented assembling (TAS) strategies to unleash the reasoning ability of MLLMs. The DSTH strategy first decouples the original query into attribute and action sub-queries for inquiring the existence of the target both spatially and temporally. It then uses a novel logit-guided re-attention (LRA) module to learn latent variables as spatial and temporal prompts, by regularizing token predictions for each sub-query. These prompts highlight attribute and action cues, respectively, directing the model's attention to reliable spatial and temporal related visual regions. In addition, as the spatial grounding by the attribute sub-query should be temporally consistent, we introduce the TAS strategy to assemble the predictions using the original video frames and the temporal-augmented frames as inputs to help improve temporal consistency. We evaluate our method on various MLLMs, and show that it outperforms SOTA methods on three common STVG benchmarks. The code will be available at https://github.com/zaiquanyang/LLaVA_Next_STVG.

View Paper