Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models

Yi-Fan Zhang, Qingsong Wen, Chaoyou Fu, Xue Wang, Zhang Zhang, Liang Wang, Rong Jin

2024-06-13

Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models

Summary

This paper discusses a new approach called SliME for improving high-resolution Large Multimodal Models (LMMs), which are AI systems that can understand and generate both images and text. The authors focus on optimizing how these models handle visual information to enhance performance without requiring excessive computational resources.

What's the problem?

Current methods for increasing the resolution of images in LMMs often lead to high computational costs because they use many local image patches that need to be processed. This can make the models slow and inefficient. Additionally, when too much focus is placed on local details, the overall context of the image may be lost, which is crucial for understanding and reasoning about visual content.

What's the solution?

The authors propose a new framework that uses a mixture of adapters to extract important contextual information from the entire image while also introducing learnable query embeddings to reduce the number of local image tokens processed. This means that instead of using a lot of local patches, they focus on fewer but more informative ones, which helps maintain the overall context. They also suggest an alternating training strategy that balances learning between global and local aspects, leading to better performance. To support this method, they created a challenging dataset that requires detailed images for training.

Why it matters?

This research is significant because it shows that using fewer but more relevant image tokens can improve the performance of multimodal models, making them more efficient and effective. By addressing the challenges of high-resolution image processing, this work could lead to advancements in various applications such as computer vision, robotics, and AI-driven creative tools.

Abstract

Seeing clearly with high resolution is a foundation of Large Multimodal Models (LMMs), which has been proven to be vital for visual perception and reasoning. Existing works usually employ a straightforward resolution upscaling method, where the image consists of global and local branches, with the latter being the sliced image patches but resized to the same resolution as the former. This means that higher resolution requires more local patches, resulting in exorbitant computational expenses, and meanwhile, the dominance of local image tokens may diminish the global context. In this paper, we dive into the problems and propose a new framework as well as an elaborate optimization strategy. Specifically, we extract contextual information from the global view using a mixture of adapters, based on the observation that different adapters excel at different tasks. With regard to local patches, learnable query embeddings are introduced to reduce image tokens, the most important tokens accounting for the user question will be further selected by a similarity-based selector. Our empirical results demonstrate a `less is more' pattern, where utilizing fewer but more informative local image tokens leads to improved performance. Besides, a significant challenge lies in the training strategy, as simultaneous end-to-end training of the global mining block and local compression block does not yield optimal results. We thus advocate for an alternating training way, ensuring balanced learning between global and local aspects. Finally, we also introduce a challenging dataset with high requirements for image detail, enhancing the training of the local compression layer. The proposed method, termed LMM with Sophisticated Tasks, Local image compression, and Mixture of global Experts (SliME), achieves leading performance across various benchmarks with only 2 million training data.

View Paper