Adapting Self-Supervised Representations as a Latent Space for Efficient Generation

Ming Gui, Johannes Schusterbauer, Timy Phan, Felix Krause, Josh Susskind, Miguel Angel Bautista, Björn Ommer

2025-10-20

Adapting Self-Supervised Representations as a Latent Space for Efficient Generation

Summary

This paper introduces a new way to create images using artificial intelligence, called Representation Tokenizer, or RepTok. It focuses on representing an entire image with just a single 'token' – think of it like a code – that captures all the important information about the picture.

What's the problem?

Traditionally, generating images with AI requires complex spaces to represent the image data, often involving lots of numbers arranged in grids. This can be computationally expensive and inefficient, meaning it takes a lot of processing power and time to train the AI. Existing methods struggle to balance capturing detailed image information with keeping the process manageable.

What's the solution?

RepTok uses a pre-trained AI model that's already good at understanding images. Instead of changing the whole model, they only adjust a small part – the 'semantic token embedding' – to better represent the image. They then pair this token with another AI component that actually builds the image from that single token. To make sure the token represents images in a sensible way, they also add a rule that keeps similar images close together in the token space. This approach simplifies the process and reduces the amount of computing needed.

Why it matters?

This research shows that you can create high-quality images efficiently by cleverly using existing AI technology and representing images with a single, compact token. This is important because it lowers the cost and time needed for image generation, and it opens the door to creating images from text descriptions with less training data than previously required, making AI image creation more accessible and practical.

Abstract

We introduce Representation Tokenizer (RepTok), a generative modeling framework that represents an image using a single continuous latent token obtained from self-supervised vision transformers. Building on a pre-trained SSL encoder, we fine-tune only the semantic token embedding and pair it with a generative decoder trained jointly using a standard flow matching objective. This adaptation enriches the token with low-level, reconstruction-relevant details, enabling faithful image reconstruction. To preserve the favorable geometry of the original SSL space, we add a cosine-similarity loss that regularizes the adapted token, ensuring the latent space remains smooth and suitable for generation. Our single-token formulation resolves spatial redundancies of 2D latent spaces and significantly reduces training costs. Despite its simplicity and efficiency, RepTok achieves competitive results on class-conditional ImageNet generation and naturally extends to text-to-image synthesis, reaching competitive zero-shot performance on MS-COCO under extremely limited training budgets. Our findings highlight the potential of fine-tuned SSL representations as compact and effective latent spaces for efficient generative modeling.

View Paper