Visual Generation Tuning

Jiahao Guo, Sinan Du, Jingfeng Yao, Wenyu Liu, Bo Li, Haoxiang Cao, Kun Gai, Chun Yuan, Kai Wu, Xinggang Wang

2025-12-09

Summary

This paper introduces a new method called Visual Generation Tuning (VGT) that allows existing Vision Language Models (VLMs) – which are good at understanding both images and text – to also *create* images. It’s about unlocking a hidden ability within these models.

What's the problem?

Currently, VLMs excel at tasks like describing an image or answering questions about it, but generating realistic images from scratch is difficult. Existing methods for image generation often require a lot of training and complex setups, like using special components to compress and decompress images. These methods can be slow and don't always produce the best results, and it's hard to get VLMs to generate images without a lot of extra work.

What's the solution?

The researchers propose VGT, which essentially ‘tunes’ a pre-trained VLM to generate images. Instead of building new, complicated parts for image creation, they cleverly connect the VLM’s existing understanding of images and language to the parts of the model that actually build the image pixel by pixel. This makes the process much faster – about 20 times faster – and more efficient, and it performs better than existing image generation techniques, achieving high scores on standard image quality tests. They essentially repurpose what the model already knows about images to create new ones.

Why it matters?

This work is important because it shows that powerful VLMs already have the potential to generate images, and we don't necessarily need entirely new models or complex architectures to do so. VGT offers a simpler and more effective way to add image generation capabilities to existing models, paving the way for more versatile ‘foundation models’ that can handle all sorts of visual and language tasks in a unified way. This could lead to significant advancements in areas like image editing, content creation, and artificial intelligence in general.

Abstract

Large Vision Language Models (VLMs) effectively bridge the modality gap through extensive pretraining, acquiring sophisticated visual representations aligned with language. However, it remains underexplored whether these representations, optimized for multimodal understanding tasks, harbor an inherent potential for visual generation. In this paper, we propose VGT, Visual Generation Tuning, a novel paradigm designed to stimulate the underlying capabilities of visual generation within any vision language models. By performing efficient visual generation tuning on well-pretrained VLMs, we significantly mitigate the alignment costs and accelerate the convergence of autoregressive modeling in the continuous space (20x speedup). Specifically, we dismiss the entangled pixel-level VAEs designed for diffusion transformers and formulate VGT-AE through aligning the semantic encoders from pretrained VLMs with the latent representations of pixel decoders. In image reconstruction tasks, we achieve 26.67 PSNR and 0.50 rFID at a 28x compression ratio, outperforming specialized VAEs; in visual generation tasks, we achieve state-of-the-art outcomes among autoregressive models, 0.77 on GenEval and 78.73 on DPG-Bench. Furthermore, our proposed VGT showcases significant scaling promise and is versatile for endowing any VLMs trained for multimodal understanding with the capabilities of visual generation, which paves the new avenue to explore next-generation unified multimodal foundation models. Models and codes are available at https://github.com/hustvl/VGT.

View Paper