What matters for Representation Alignment: Global Information or Spatial Structure?
Jaskirat Singh, Xingjian Leng, Zongze Wu, Liang Zheng, Richard Zhang, Eli Shechtman, Saining Xie
2025-12-16
Summary
This paper investigates how to best guide image generation using a technique called representation alignment, which essentially teaches a generative model to create images based on the understanding of a pre-trained image analyzer.
What's the problem?
Researchers generally believed that the better a pre-trained image analyzer is at *identifying* objects (its overall accuracy), the better the generated images would be when using that analyzer as a guide. This paper questions that assumption, asking whether it's the analyzer's ability to understand *what* is in the image, or its ability to understand *where* things are in the image (spatial structure) that truly matters for good image generation.
What's the solution?
The researchers performed a large experiment with many different image analyzers and found that spatial structure is actually more important than overall accuracy for generating good images. To improve the process, they made two simple changes to the representation alignment technique: they used a convolutional layer instead of a standard layer to process information, and they added a step to specifically emphasize spatial relationships. They call this improved method iREPA, and it consistently sped up the training process and improved results.
Why it matters?
This work challenges the common understanding of how representation alignment works, suggesting that focusing on spatial information is key to better image generation. It provides a simple and effective way to improve existing generative models and encourages further research into the underlying mechanisms of representational alignment.
Abstract
Representation alignment (REPA) guides generative training by distilling representations from a strong, pretrained vision encoder to intermediate diffusion features. We investigate a fundamental question: what aspect of the target representation matters for generation, its global semantic information (e.g., measured by ImageNet-1K accuracy) or its spatial structure (i.e. pairwise cosine similarity between patch tokens)? Prevalent wisdom holds that stronger global semantic performance leads to better generation as a target representation. To study this, we first perform a large-scale empirical analysis across 27 different vision encoders and different model scales. The results are surprising; spatial structure, rather than global performance, drives the generation performance of a target representation. To further study this, we introduce two straightforward modifications, which specifically accentuate the transfer of spatial information. We replace the standard MLP projection layer in REPA with a simple convolution layer and introduce a spatial normalization layer for the external representation. Surprisingly, our simple method (implemented in <4 lines of code), termed iREPA, consistently improves convergence speed of REPA, across a diverse set of vision encoders, model sizes, and training variants (such as REPA, REPA-E, Meanflow, JiT etc). %, etc. Our work motivates revisiting the fundamental working mechanism of representational alignment and how it can be leveraged for improved training of generative models. The code and project page are available at https://end2end-diffusion.github.io/irepa