VideoLoom: A Video Large Language Model for Joint Spatial-Temporal Understanding

Jiapeng Shi, Junke Wang, Zuyao You, Bo He, Zuxuan Wu

2026-01-14

VideoLoom: A Video Large Language Model for Joint Spatial-Temporal Understanding

Summary

This paper introduces VideoLoom, a new artificial intelligence model that's really good at understanding what's happening in videos, considering both *where* things are happening and *when* they're happening.

What's the problem?

Existing AI models often struggle to fully grasp videos because they either focus on recognizing objects within a frame (spatial understanding) or understanding the sequence of events (temporal understanding), but not both together effectively. There also wasn't a good, comprehensive set of data available to train and test these kinds of models on both spatial and temporal aspects of video at the same time.

What's the solution?

The researchers created VideoLoom, a model designed for 'joint' spatial-temporal understanding. To help train it, they also built a new dataset called LoomData-8.7k, which includes detailed descriptions of videos that pinpoint *where* and *when* specific actions occur. They also created LoomBench, a set of challenging questions about videos to test how well the model understands both space and time. By training VideoLoom on this data and testing it with LoomBench, they achieved top results on several video understanding tasks.

Why it matters?

This work is important because it pushes the field of AI closer to truly understanding video content like humans do. It provides a new standard for evaluating video AI models and offers a powerful tool for applications like video editing, automated video analysis, and more advanced robotics that need to interact with the real world.

Abstract

This paper presents VideoLoom, a unified Video Large Language Model (Video LLM) for joint spatial-temporal understanding. To facilitate the development of fine-grained spatial and temporal localization capabilities, we curate LoomData-8.7k, a human-centric video dataset with temporally grounded and spatially localized captions. With this, VideoLoom achieves state-of-the-art or highly competitive performance across a variety of spatial and temporal benchmarks (e.g., 63.1 J&F on ReVOS for referring video object segmentation, and 48.3 R1@0.7 on Charades-STA for temporal grounding). In addition, we introduce LoomBench, a novel benchmark consisting of temporal, spatial, and compositional video-question pairs, enabling a comprehensive evaluation of Video LLMs from diverse aspects. Collectively, these contributions offer a universal and effective suite for joint spatial-temporal video understanding, setting a new standard in multimodal intelligence.

View Paper