OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding

Tao Zhang, Xiangtai Li, Hao Fei, Haobo Yuan, Shengqiong Wu, Shunping Ji, Chen Change Loy, Shuicheng Yan

2024-06-28

OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding

Summary

This paper talks about OMG-LLaVA, a new system that combines advanced image understanding with reasoning abilities. It allows users to interact with the model using both visual and text prompts, making it more flexible and capable of handling various tasks.

What's the problem?

Current methods for understanding images and videos can analyze pixel-level details well but struggle with reasoning and following text instructions. On the other hand, large models that can understand language and engage in conversations often lack the ability to analyze images at a detailed level. This gap makes it hard to create systems that can effectively work with both images and text together.

What's the solution?

To solve this problem, the authors developed OMG-LLaVA, which uses a universal segmentation method as its visual encoder. This means it can take in image information and combine it with user prompts to create visual tokens that the model can understand. The system is designed to process both text instructions and visual data, allowing it to provide detailed responses and perform tasks like segmenting images into different parts based on user requests. The authors also introduced a method called perception prior embedding to enhance how the model integrates visual features with its understanding.

Why it matters?

This research is important because it creates a more powerful tool for combining visual and textual information in one model. By achieving image-level, object-level, and pixel-level reasoning in a single framework, OMG-LLaVA can perform better on various tasks compared to specialized models. This advancement has the potential to improve applications in areas like computer vision, interactive AI systems, and any field where understanding both text and images is crucial.

Abstract

Current universal segmentation methods demonstrate strong capabilities in pixel-level image and video understanding. However, they lack reasoning abilities and cannot be controlled via text instructions. In contrast, large vision-language multimodal models exhibit powerful vision-based conversation and reasoning capabilities but lack pixel-level understanding and have difficulty accepting visual prompts for flexible user interaction. This paper proposes OMG-LLaVA, a new and elegant framework combining powerful pixel-level vision understanding with reasoning abilities. It can accept various visual and text prompts for flexible user interaction. Specifically, we use a universal segmentation method as the visual encoder, integrating image information, perception priors, and visual prompts into visual tokens provided to the LLM. The LLM is responsible for understanding the user's text instructions and providing text responses and pixel-level segmentation results based on the visual information. We propose perception prior embedding to better integrate perception priors with image features. OMG-LLaVA achieves image-level, object-level, and pixel-level reasoning and understanding in a single model, matching or surpassing the performance of specialized methods on multiple benchmarks. Rather than using LLM to connect each specialist, our work aims at end-to-end training on one encoder, one decoder, and one LLM. The code and model have been released for further research.

View Paper