VideoMolmo: Spatio-Temporal Grounding Meets Pointing

Ghazi Shazan Ahmad, Ahmed Heakl, Hanan Gani, Abdelrahman Shaker, Zhiqiang Shen, Ranjay Krishna, Fahad Shahbaz Khan, Salman Khan

2025-06-18

VideoMolmo: Spatio-Temporal Grounding Meets Pointing

Summary

This paper talks about VideoMolmo, a clever AI model that combines video understanding with pointing to objects in videos using both space and time information. It uses a special attention system to understand how objects move across video frames, and it uses another technique called SAM2 to help create clear outlines of the pointed objects.

What's the problem?

The problem is that many current video models can track objects, but they lack the ability to understand and reason about subtle movements and changes over time, which is necessary for precise pointing and understanding in videos.

What's the solution?

The researchers built VideoMolmo on an earlier model called Molmo but added a temporal module that makes each video frame aware of the previous ones using attention. They also created a new method that combines points into coherent masks across all video frames using SAM2. They trained and tested the model with a large video dataset and a new benchmark that includes various real-world scenarios.

Why it matters?

This matters because it allows AI to better understand and interact with dynamic video environments, which is useful for things like robotics, autonomous driving, video editing, and other applications that need precise and smart video analysis.

Abstract

VideoMolmo, a multimodal model incorporating a temporal attention mechanism and SAM2 for mask fusion, enhances spatio-temporal pointing accuracy and reasoning capabilities in diverse real-world scenarios.

View Paper