Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning

Wujian Peng, Lingchen Meng, Yitong Chen, Yiweng Xie, Yang Liu, Tao Gui, Hang Xu, Xipeng Qiu, Zuxuan Wu, Yu-Gang Jiang

2024-12-05

Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning

Summary

This paper introduces Inst-IT, a new approach to improve how large multimodal models (LMMs) understand specific instances in images and videos by using explicit visual prompts during instruction tuning.

What's the problem?

While current LMMs can understand images and videos overall, they often struggle to focus on specific details or instances within those visuals. This lack of instance-level understanding makes it difficult for these models to accurately identify and interact with individual objects or elements, which is crucial for tasks like object detection and tracking.

What's the solution?

To solve this issue, the researchers developed an automated annotation pipeline that uses GPT-4o to extract detailed instance-level information from images and videos. They created the Inst-IT framework, which includes a benchmark for testing instance-level understanding, a large dataset for instruction tuning, and a continuous training method that enhances the model's ability to understand spatial and temporal relationships between instances. This allows the model to better comprehend specific elements in both images and videos.

Why it matters?

This research is significant because it enhances the capabilities of AI models in recognizing and understanding individual objects within complex visuals. By improving instance-level understanding, Inst-IT can lead to better performance in various applications such as autonomous driving, robotics, and augmented reality, where precise identification of objects is essential for effective operation.

Abstract

Large Multimodal Models (LMMs) have made significant breakthroughs with the advancement of instruction tuning. However, while existing models can understand images and videos at a holistic level, they still struggle with instance-level understanding that requires a more nuanced comprehension and alignment. Instance-level understanding is crucial, as it focuses on the specific elements that we are most interested in. Excitingly, existing works find that the state-of-the-art LMMs exhibit strong instance understanding capabilities when provided with explicit visual cues. Motivated by this, we introduce an automated annotation pipeline assisted by GPT-4o to extract instance-level information from images and videos through explicit visual prompting for instance guidance. Building upon this pipeline, we proposed Inst-IT, a solution to enhance LMMs in Instance understanding via explicit visual prompt Instruction Tuning. Inst-IT consists of a benchmark to diagnose multimodal instance-level understanding, a large-scale instruction-tuning dataset, and a continuous instruction-tuning training paradigm to effectively enhance spatial-temporal instance understanding capabilities of existing LMMs. Experimental results show that, with the boost of Inst-IT, our models not only achieve outstanding performance on Inst-IT Bench but also demonstrate significant improvements across various generic image and video understanding benchmarks. This highlights that our dataset not only boosts instance-level understanding but also strengthens the overall capabilities of generic image and video comprehension.

View Paper