Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal Understanding

Zhaokai Wang, Xizhou Zhu, Xue Yang, Gen Luo, Hao Li, Changyao Tian, Wenhan Dou, Junqi Ge, Lewei Lu, Yu Qiao, Jifeng Dai

2025-01-16

Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal Understanding

Summary

This paper talks about a new way to make computers understand images better and faster, called Parameter-Inverted Image Pyramid Networks (PIIP). It's like teaching a computer to look at pictures the way humans do, by focusing on different details at different zoom levels.

What's the problem?

Current methods for making computers understand images use something called image pyramids. These are like looking at a picture zoomed in and zoomed out at the same time. The problem is, they use the same big, complicated system to look at all these different zoom levels, which takes a lot of computer power and time.

What's the solution?

The researchers created PIIP, which is smarter about how it looks at images. Instead of using one big system for everything, it uses smaller, simpler systems to look at the zoomed-in parts of the image, and bigger systems for the zoomed-out parts. They also made a way for these different parts to share information with each other. They tested PIIP on lots of different tasks, like finding objects in pictures and understanding what's going on in images and text together.

Why it matters?

This matters because it makes computers better at understanding images without needing as much power. This could help make things like self-driving cars, security cameras, and even AI assistants that can see and understand what's around them work better and faster. It's also important for making AI that can understand both pictures and words together, which could lead to better virtual assistants or help computers understand and describe what's in photos more accurately.

Abstract

Image pyramids are widely adopted in top-performing methods to obtain multi-scale features for precise visual perception and understanding. However, current image pyramids use the same large-scale model to process multiple resolutions of images, leading to significant computational cost. To address this challenge, we propose a novel network architecture, called Parameter-Inverted Image Pyramid Networks (PIIP). Specifically, PIIP uses pretrained models (ViTs or CNNs) as branches to process multi-scale images, where images of higher resolutions are processed by smaller network branches to balance computational cost and performance. To integrate information from different spatial scales, we further propose a novel cross-branch feature interaction mechanism. To validate PIIP, we apply it to various perception models and a representative multimodal large language model called LLaVA, and conduct extensive experiments on various tasks such as object detection, segmentation, image classification and multimodal understanding. PIIP achieves superior performance compared to single-branch and existing multi-resolution approaches with lower computational cost. When applied to InternViT-6B, a large-scale vision foundation model, PIIP can improve its performance by 1%-2% on detection and segmentation with only 40%-60% of the original computation, finally achieving 60.0 box AP on MS COCO and 59.7 mIoU on ADE20K. For multimodal understanding, our PIIP-LLaVA achieves 73.0% accuracy on TextVQA and 74.5% on MMBench with only 2.8M training data. Our code is released at https://github.com/OpenGVLab/PIIP.

View Paper