The Collapse of Patches

Wei Guo, Shunqi Mao, Zhuonan Liang, Heng Wang, Weidong Cai

2025-12-01

Summary

This paper explores a surprising idea about how we 'see' images: when we focus on certain parts of an image, it actually makes other parts seem clearer, almost like resolving uncertainty. They call this 'patch collapse' and use it to improve how computers understand and generate images.

What's the problem?

Current methods for computers to understand images often process every single part of the image equally, which is inefficient. The paper points out that not all parts of an image are equally important for understanding what's going on. There's a hidden order to how different areas influence our perception, and existing methods don't take advantage of this.

What's the solution?

The researchers developed a way to figure out which parts of an image are most crucial for understanding other parts. They used a special type of neural network called an autoencoder to learn these dependencies. By analyzing these dependencies, they could rank the image patches in the order they 'collapse' or become clear. Then, they tested if using this order to show images to computer vision models would improve performance. They applied this to both generating images and classifying them.

Why it matters?

This work is important because it suggests a more efficient way for computers to 'see' images. By focusing on the most important parts first, models can achieve similar or better results with significantly less information. This could lead to faster and less resource-intensive image processing, which is crucial for applications like self-driving cars, medical imaging, and more.

Abstract

Observing certain patches in an image reduces the uncertainty of others. Their realization lowers the distribution entropy of each remaining patch feature, analogous to collapsing a particle's wave function in quantum mechanics. This phenomenon can intuitively be called patch collapse. To identify which patches are most relied on during a target region's collapse, we learn an autoencoder that softly selects a subset of patches to reconstruct each target patch. Graphing these learned dependencies for each patch's PageRank score reveals the optimal patch order to realize an image. We show that respecting this order benefits various masked image modeling methods. First, autoregressive image generation can be boosted by retraining the state-of-the-art model MAR. Next, we introduce a new setup for image classification by exposing Vision Transformers only to high-rank patches in the collapse order. Seeing 22\% of such patches is sufficient to achieve high accuracy. With these experiments, we propose patch collapse as a novel image modeling perspective that promotes vision efficiency. Our project is available at https://github.com/wguo-ai/CoP .

View Paper