An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels

Duy-Kien Nguyen, Mahmoud Assran, Unnat Jain, Martin R. Oswald, Cees G. M. Snoek, Xinlei Chen

2024-06-14

An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels

Summary

This paper discusses a new finding that challenges the common belief in computer vision that models need to focus on small sections of images, like 16x16 patches. Instead, it shows that using individual pixels as tokens can lead to better performance in various tasks.

What's the problem?

In computer vision, many models, especially Vision Transformers, are designed to analyze images by breaking them into small patches. This approach assumes that looking at local neighborhoods (or small areas) is essential for understanding images. However, this can limit how well models learn and perform because it may not capture all the important details in an image.

What's the solution?

The authors of this paper demonstrate that vanilla Transformers can effectively treat each pixel as a separate token instead of relying on patches. They tested this method across three key tasks: classifying objects in images, using self-supervised learning techniques, and generating images with diffusion models. Although working with individual pixels can be less efficient computationally, their findings suggest that this approach can yield better results than traditional methods.

Why it matters?

This research is significant because it opens up new possibilities for designing AI models in computer vision. By showing that treating pixels individually can be effective, it encourages researchers to rethink how they build and train these models, potentially leading to advancements in image analysis and generation technologies.

Abstract

This work does not introduce a new method. Instead, we present an interesting finding that questions the necessity of the inductive bias -- locality in modern computer vision architectures. Concretely, we find that vanilla Transformers can operate by directly treating each individual pixel as a token and achieve highly performant results. This is substantially different from the popular design in Vision Transformer, which maintains the inductive bias from ConvNets towards local neighborhoods (e.g. by treating each 16x16 patch as a token). We mainly showcase the effectiveness of pixels-as-tokens across three well-studied tasks in computer vision: supervised learning for object classification, self-supervised learning via masked autoencoding, and image generation with diffusion models. Although directly operating on individual pixels is less computationally practical, we believe the community must be aware of this surprising piece of knowledge when devising the next generation of neural architectures for computer vision.

View Paper