One Layer Is Enough: Adapting Pretrained Visual Encoders for Image Generation

Yuan Gao, Chen Chen, Tianrong Chen, Jiatao Gu

2025-12-09

One Layer Is Enough: Adapting Pretrained Visual Encoders for Image Generation

Summary

This paper introduces a new method called FAE (Feature Auto-Encoder) that helps improve how images are generated by AI, specifically by better utilizing existing, powerful image understanding tools.

What's the problem?

Current image generation models often struggle to effectively use pre-trained image understanding systems. These systems are good at 'knowing' what's in an image, but image generators need a different kind of information to actually *create* realistic images. The issue is that understanding-focused systems need detailed, complex information, while generators work best with simpler, more streamlined data. This mismatch makes it hard to combine the strengths of both.

What's the solution?

FAE solves this by acting as a bridge between the two. It takes the detailed information from the pre-trained image understanding system and compresses it into a simpler format that the image generator can use. It does this using two separate 'decoders': one to rebuild the original detailed information, and another to use that rebuilt information to create the image. Importantly, it achieves this with a relatively simple design, often using just one attention layer.

Why it matters?

This is important because it allows for higher quality image generation with less training time. The paper shows FAE can achieve results comparable to the best existing methods, and even surpass them in some cases, while also learning faster. This means we can create more realistic and detailed images more efficiently, opening up possibilities for various applications like art, design, and scientific visualization.

Abstract

Visual generative models (e.g., diffusion models) typically operate in compressed latent spaces to balance training efficiency and sample quality. In parallel, there has been growing interest in leveraging high-quality pre-trained visual representations, either by aligning them inside VAEs or directly within the generative model. However, adapting such representations remains challenging due to fundamental mismatches between understanding-oriented features and generation-friendly latent spaces. Representation encoders benefit from high-dimensional latents that capture diverse hypotheses for masked regions, whereas generative models favor low-dimensional latents that must faithfully preserve injected noise. This discrepancy has led prior work to rely on complex objectives and architectures. In this work, we propose FAE (Feature Auto-Encoder), a simple yet effective framework that adapts pre-trained visual representations into low-dimensional latents suitable for generation using as little as a single attention layer, while retaining sufficient information for both reconstruction and understanding. The key is to couple two separate deep decoders: one trained to reconstruct the original feature space, and a second that takes the reconstructed features as input for image generation. FAE is generic; it can be instantiated with a variety of self-supervised encoders (e.g., DINO, SigLIP) and plugged into two distinct generative families: diffusion models and normalizing flows. Across class-conditional and text-to-image benchmarks, FAE achieves strong performance. For example, on ImageNet 256x256, our diffusion model with CFG attains a near state-of-the-art FID of 1.29 (800 epochs) and 1.70 (80 epochs). Without CFG, FAE reaches the state-of-the-art FID of 1.48 (800 epochs) and 2.08 (80 epochs), demonstrating both high quality and fast learning.

View Paper