Adaptive Length Image Tokenization via Recurrent Allocation
Shivam Duggal, Phillip Isola, Antonio Torralba, William T. Freeman
2024-11-06

Summary
This paper introduces a new method called Adaptive Length Image Tokenization (ALIT) that allows computers to process images more efficiently by using a variable number of tokens based on the complexity of the image.
What's the problem?
Most current image processing systems use fixed-length representations for images, which means they treat all images the same way, regardless of how much information they contain. This approach can be inefficient because simpler images might not need as many tokens, while more complex images may require more detail to understand them properly.
What's the solution?
The researchers developed ALIT, which uses an encoder-decoder setup to convert 2D images into a flexible number of 1D tokens. This method involves processing the image multiple times, refining the token representations each time and allowing the system to add more tokens when needed. By doing this, the model can adaptively adjust how many tokens it uses based on the image's complexity, ranging from 32 to 256 tokens.
Why it matters?
This advancement is significant because it improves how machines understand and represent images. By using a variable number of tokens, ALIT can enhance image compression and make processing more efficient, which is crucial for applications like computer vision, where understanding images accurately and quickly is essential.
Abstract
Current vision systems typically assign fixed-length representations to images, regardless of the information content. This contrasts with human intelligence - and even large language models - which allocate varying representational capacities based on entropy, context and familiarity. Inspired by this, we propose an approach to learn variable-length token representations for 2D images. Our encoder-decoder architecture recursively processes 2D image tokens, distilling them into 1D latent tokens over multiple iterations of recurrent rollouts. Each iteration refines the 2D tokens, updates the existing 1D latent tokens, and adaptively increases representational capacity by adding new tokens. This enables compression of images into a variable number of tokens, ranging from 32 to 256. We validate our tokenizer using reconstruction loss and FID metrics, demonstrating that token count aligns with image entropy, familiarity and downstream task requirements. Recurrent token processing with increasing representational capacity in each iteration shows signs of token specialization, revealing potential for object / part discovery.