Both Semantics and Reconstruction Matter: Making Representation Encoders Ready for Text-to-Image Generation and Editing
Shilong Zhang, He Zhang, Zhifei Zhang, Chongjian Ge, Shuchen Xue, Shaoteng Liu, Mengwei Ren, Soo Ye Kim, Yuqian Zhou, Qing Liu, Daniil Pakhomov, Kai Zhang, Zhe Lin, Ping Luo
2025-12-22
Summary
This paper focuses on improving how AI creates images by using a different starting point than usual. Instead of starting with raw pixel data, it explores using more meaningful 'features' extracted from images by AI 'understanding' models as the basis for image generation.
What's the problem?
Currently, using these 'understanding' features for image creation has two main issues. First, these features aren't organized in a way that easily allows the AI to create realistic structures, leading to distorted or inaccurate objects. Second, because these features don't focus on recreating the original image pixel-for-pixel, the AI struggles to generate fine details like textures and precise shapes.
What's the solution?
The researchers developed a new method to prepare these 'understanding' features for image generation. They created a special training process that forces the AI to both understand the overall meaning of an image *and* accurately recreate the pixel details. This results in a compact set of features that contain both semantic information and fine-grained details, allowing for better image creation. They then built an AI model that can create images from text descriptions and edit existing images using these improved features.
Why it matters?
This work is important because it shows that we can effectively use AI models designed for understanding images to also *create* high-quality images. This approach leads to faster image generation, better results in both creating images from text and editing existing ones, and opens the door to more powerful and versatile AI image tools.
Abstract
Modern Latent Diffusion Models (LDMs) typically operate in low-level Variational Autoencoder (VAE) latent spaces that are primarily optimized for pixel-level reconstruction. To unify vision generation and understanding, a burgeoning trend is to adopt high-dimensional features from representation encoders as generative latents. However, we empirically identify two fundamental obstacles in this paradigm: (1) the discriminative feature space lacks compact regularization, making diffusion models prone to off-manifold latents that lead to inaccurate object structures; and (2) the encoder's inherently weak pixel-level reconstruction hinders the generator from learning accurate fine-grained geometry and texture. In this paper, we propose a systematic framework to adapt understanding-oriented encoder features for generative tasks. We introduce a semantic-pixel reconstruction objective to regularize the latent space, enabling the compression of both semantic information and fine-grained details into a highly compact representation (96 channels with 16x16 spatial downsampling). This design ensures that the latent space remains semantically rich and achieves state-of-the-art image reconstruction, while remaining compact enough for accurate generation. Leveraging this representation, we design a unified Text-to-Image (T2I) and image editing model. Benchmarking against various feature spaces, we demonstrate that our approach achieves state-of-the-art reconstruction, faster convergence, and substantial performance gains in both T2I and editing tasks, validating that representation encoders can be effectively adapted into robust generative components.