Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks

Miran Heo, Min-Hung Chen, De-An Huang, Sifei Liu, Subhashree Radhakrishnan, Seon Joo Kim, Yu-Chiang Frank Wang, Ryo Hachiuma

2025-01-15

Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks

Summary

This paper talks about Omni-RGPT, a new AI model that can understand specific parts of both images and videos. It's like having a super-smart assistant that can point out and describe different things in pictures and movies, all using the same method.

What's the problem?

Current AI models are good at understanding either images or videos, but not both at the same time. They also struggle with identifying and tracking specific parts of an image or video, especially when things are moving around. This makes it hard for AI to understand and describe complex scenes in a way that's similar to how humans do it.

What's the solution?

The researchers created Omni-RGPT, which uses a clever system called Token Mark. It's like giving the AI a set of digital highlighters to mark important areas in images and videos. These marks work for both still pictures and moving videos, helping the AI keep track of things even when they move around. They also made a huge collection of video clips with detailed descriptions to train the AI. This helps Omni-RGPT learn how to understand and describe specific parts of videos really well.

Why it matters?

This matters because it could make AI much better at understanding visual information in a way that's more like how humans do. It could lead to smarter virtual assistants that can describe what's happening in videos, help with video search engines, or even assist in fields like security or healthcare where understanding specific parts of images and videos is crucial. By making AI better at understanding both images and videos in the same way, it opens up new possibilities for how we can use AI in our daily lives and in various industries.

Abstract

We present Omni-RGPT, a multimodal large language model designed to facilitate region-level comprehension for both images and videos. To achieve consistent region representation across spatio-temporal dimensions, we introduce Token Mark, a set of tokens highlighting the target regions within the visual feature space. These tokens are directly embedded into spatial regions using region prompts (e.g., boxes or masks) and simultaneously incorporated into the text prompt to specify the target, establishing a direct connection between visual and text tokens. To further support robust video understanding without requiring tracklets, we introduce an auxiliary task that guides Token Mark by leveraging the consistency of the tokens, enabling stable region interpretation across the video. Additionally, we introduce a large-scale region-level video instruction dataset (RegVID-300k). Omni-RGPT achieves state-of-the-art results on image and video-based commonsense reasoning benchmarks while showing strong performance in captioning and referring expression comprehension tasks.

View Paper