"Principal Components" Enable A New Language of Images

Xin Wen, Bingchen Zhao, Ismail Elezi, Jiankang Deng, Xiaojuan Qi

2025-03-12

"Principal Components" Enable A New Language of Images

Summary

This paper talks about a new way to represent images in AI by breaking them into ordered parts, starting with the most important details and adding smaller ones later, like listing the main points of a picture first and then the background.

What's the problem?

Current AI systems that turn images into code-like tokens focus too much on making the rebuilt image look perfect, ignoring how humans naturally notice big shapes first and details later, which makes AI less efficient and harder to understand.

What's the solution?

The new method organizes image tokens like a ranked list, where the first tokens capture the main objects or shapes, and later tokens add finer details. It also uses a special decoder to separate big-picture meaning from tiny visual details, like making sure you notice a dog before its fur pattern.

Why it matters?

This helps AI understand images more like humans do, making it faster to process pictures and easier to explain why AI makes certain decisions, which is crucial for medical imaging, self-driving cars, or art tools.

Abstract

We introduce a novel visual tokenization framework that embeds a provable PCA-like structure into the latent token space. While existing visual tokenizers primarily optimize for reconstruction fidelity, they often neglect the structural properties of the latent space -- a critical factor for both interpretability and downstream tasks. Our method generates a 1D causal token sequence for images, where each successive token contributes non-overlapping information with mathematically guaranteed decreasing explained variance, analogous to principal component analysis. This structural constraint ensures the tokenizer extracts the most salient visual features first, with each subsequent token adding diminishing yet complementary information. Additionally, we identified and resolved a semantic-spectrum coupling effect that causes the unwanted entanglement of high-level semantic content and low-level spectral details in the tokens by leveraging a diffusion decoder. Experiments demonstrate that our approach achieves state-of-the-art reconstruction performance and enables better interpretability to align with the human vision system. Moreover, auto-regressive models trained on our token sequences achieve performance comparable to current state-of-the-art methods while requiring fewer tokens for training and inference.

View Paper