Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, Junyang Lin
2024-09-19

Summary
This paper introduces Qwen2-VL, an upgraded vision-language model that improves how computers understand and process images and videos at any resolution.
What's the problem?
Previous vision-language models had limitations because they worked best at specific image resolutions. This made it hard for them to accurately interpret images and videos that varied in size and quality, which is important for real-world applications where visuals can come in many forms.
What's the solution?
Qwen2-VL uses a new feature called Naive Dynamic Resolution, allowing it to process images of different sizes more effectively by converting them into a flexible number of visual tokens. This means the model can better mimic how humans perceive images. Additionally, it incorporates Multimodal Rotary Position Embedding (M-RoPE) to effectively combine information from text, images, and videos. The model has been tested with various sizes (2B, 8B, and 72B parameters) and has shown excellent performance in understanding both images and videos.
Why it matters?
This research is significant because it enhances the capabilities of AI in understanding visual content, making it more useful for applications like video analysis, automated responses to visual inputs, and even assisting in tasks like content creation. By improving how AI interacts with different types of media, Qwen2-VL can lead to better user experiences in technology that relies on visual data.
Abstract
We present the Qwen2-VL Series, an advanced upgrade of the previous Qwen-VL models that redefines the conventional predetermined-resolution approach in visual processing. Qwen2-VL introduces the Naive Dynamic Resolution mechanism, which enables the model to dynamically process images of varying resolutions into different numbers of visual tokens. This approach allows the model to generate more efficient and accurate visual representations, closely aligning with human perceptual processes. The model also integrates Multimodal Rotary Position Embedding (M-RoPE), facilitating the effective fusion of positional information across text, images, and videos. We employ a unified paradigm for processing both images and videos, enhancing the model's visual perception capabilities. To explore the potential of large multimodal models, Qwen2-VL investigates the scaling laws for large vision-language models (LVLMs). By scaling both the model size-with versions at 2B, 8B, and 72B parameters-and the amount of training data, the Qwen2-VL Series achieves highly competitive performance. Notably, the Qwen2-VL-72B model achieves results comparable to leading models such as GPT-4o and Claude3.5-Sonnet across various multimodal benchmarks, outperforming other generalist models. Code is available at https://github.com/QwenLM/Qwen2-VL.