SVG-T2I: Scaling Up Text-to-Image Latent Diffusion Model Without Variational Autoencoder

Minglei Shi, Haolin Wang, Borui Zhang, Wenzhao Zheng, Bohan Zeng, Ziyang Yuan, Xiaoshi Wu, Yuanxing Zhang, Huan Yang, Xintao Wang, Pengfei Wan, Kun Gai, Jie Zhou, Jiwen Lu

2025-12-15

SVG-T2I: Scaling Up Text-to-Image Latent Diffusion Model Without Variational Autoencoder

Summary

This paper explores a new way to create images from text descriptions by working directly with the way computers 'understand' images, using something called a Visual Foundation Model (VFM).

What's the problem?

Currently, most systems that generate images from text work with pixel data directly. This research noticed that VFMs, which are good at understanding what's *in* an image, haven't been fully used to actually *create* images. It's like knowing all the parts of a car but not using that knowledge to build one.

What's the solution?

The researchers developed a system called SVG-T2I. It takes text as input and uses a standard image-generating process, but instead of working with pixels, it works within the VFM's 'understanding' of images. They essentially built a bridge between text descriptions and the VFM's internal representation, allowing the system to generate high-quality images directly from that representation.

Why it matters?

This is important because it shows that VFMs aren't just good at recognizing images, they also contain enough information to generate new, realistic images. By open-sourcing their work, the researchers hope to encourage more research into using these powerful image understanding models for image creation, potentially leading to better and more efficient image generation techniques.

Abstract

Visual generation grounded in Visual Foundation Model (VFM) representations offers a highly promising unified pathway for integrating visual understanding, perception, and generation. Despite this potential, training large-scale text-to-image diffusion models entirely within the VFM representation space remains largely unexplored. To bridge this gap, we scale the SVG (Self-supervised representations for Visual Generation) framework, proposing SVG-T2I to support high-quality text-to-image synthesis directly in the VFM feature domain. By leveraging a standard text-to-image diffusion pipeline, SVG-T2I achieves competitive performance, reaching 0.75 on GenEval and 85.78 on DPG-Bench. This performance validates the intrinsic representational power of VFMs for generative tasks. We fully open-source the project, including the autoencoder and generation model, together with their training, inference, evaluation pipelines, and pre-trained weights, to facilitate further research in representation-driven visual generation.

View Paper