LLaVA-UHD v2: an MLLM Integrating High-Resolution Feature Pyramid via Hierarchical Window Transformer

Yipeng Zhang, Yifan Liu, Zonghao Guo, Yidan Zhang, Xuesong Yang, Chi Chen, Jun Song, Bo Zheng, Yuan Yao, Zhiyuan Liu, Tat-Seng Chua, Maosong Sun

2024-12-19

LLaVA-UHD v2: an MLLM Integrating High-Resolution Feature Pyramid via Hierarchical Window Transformer

Summary

This paper talks about LLaVA-UHD v2, a new multimodal large language model (MLLM) that improves how AI understands and processes high-resolution images by using a special technique called a Hierarchical Window Transformer.

What's the problem?

While existing MLLMs use vision transformers to analyze images, they often struggle to perform well on various tasks because they don't effectively capture the different levels of detail in images. This lack of detail makes it hard for the models to generate accurate language descriptions or responses based on visual input.

What's the solution?

To solve this problem, the authors introduce LLaVA-UHD v2, which uses a Hierarchical Window Transformer to create a high-resolution feature pyramid. This method allows the model to gather and integrate information from different levels of detail in an image. It consists of two main parts: an inverse feature pyramid that enhances image details and hierarchical window attention that focuses on important features across various scales. The results show that LLaVA-UHD v2 performs better than previous models on many benchmarks, achieving an average improvement of 3.7% across 14 tests.

Why it matters?

This research is important because it enhances the ability of AI models to understand and generate responses based on complex visual information. By improving how these models process high-resolution images, LLaVA-UHD v2 can lead to better applications in areas like image recognition, automated descriptions, and more effective human-computer interactions.

Abstract

In multimodal large language models (MLLMs), vision transformers (ViTs) are widely employed for visual encoding. However, their performance in solving universal MLLM tasks is not satisfactory. We attribute it to a lack of information from diverse visual levels, impeding alignment with the various semantic granularity required for language generation. To address this issue, we present LLaVA-UHD v2, an advanced MLLM centered around a Hierarchical window transformer that enables capturing diverse visual granularity by constructing and integrating a high-resolution feature pyramid. As a vision-language projector, Hiwin transformer comprises two primary modules: (i) an inverse feature pyramid, constructed by a ViT-derived feature up-sampling process utilizing high-frequency details from an image pyramid, and (ii) hierarchical window attention, focusing on a set of key sampling features within cross-scale windows to condense multi-level feature maps. Extensive experiments demonstrate that LLaVA-UHD v2 achieves superior performance over existing MLLMs on popular benchmarks. Notably, our design brings an average boost of 3.7% across 14 benchmarks compared with the baseline method, 9.3% on DocVQA for instance. We make all the data, model checkpoint, and code publicly available to facilitate future research.

View Paper