Switti: Designing Scale-Wise Transformers for Text-to-Image Synthesis

Anton Voronov, Denis Kuznedelev, Mikhail Khoroshikh, Valentin Khrulkov, Dmitry Baranchuk

2024-12-03

Switti: Designing Scale-Wise Transformers for Text-to-Image Synthesis

Summary

This paper presents Switti, a new type of transformer model designed to generate images from text quickly and efficiently.

What's the problem?

Generating high-quality images from text descriptions can be slow and resource-intensive with existing models. Many of these models struggle with performance and require a lot of memory, which makes them less practical for real-world applications.

What's the solution?

Switti improves upon traditional models by using a scale-wise approach that allows it to generate images at different resolutions more efficiently. It modifies the architecture of existing models to enhance their performance and reduces the reliance on previous scales when generating images. This means it can produce images faster and with lower memory usage. Additionally, the model disables unnecessary guidance at high resolutions, which speeds up the process even more and improves the quality of fine details in the images.

Why it matters?

This research is significant because it provides a faster and more efficient way to create images from text, making it easier for developers and artists to use AI for generating visuals. By improving the speed and quality of image generation, Switti can be applied in various fields such as gaming, advertising, and content creation, where high-quality visuals are essential.

Abstract

This work presents Switti, a scale-wise transformer for text-to-image generation. Starting from existing next-scale prediction AR models, we first explore them for T2I generation and propose architectural modifications to improve their convergence and overall performance. We then observe that self-attention maps of our pretrained scale-wise AR model exhibit weak dependence on preceding scales. Based on this insight, we propose a non-AR counterpart facilitating {sim}11% faster sampling and lower memory usage while also achieving slightly better generation quality.Furthermore, we reveal that classifier-free guidance at high-resolution scales is often unnecessary and can even degrade performance. %may be not only unnecessary but potentially detrimental. By disabling guidance at these scales, we achieve an additional sampling acceleration of {sim}20% and improve the generation of fine-grained details. Extensive human preference studies and automated evaluations show that Switti outperforms existing T2I AR models and competes with state-of-the-art T2I diffusion models while being up to 7{times} faster.

View Paper