Show-o Turbo: Towards Accelerated Unified Multimodal Understanding and Generation

Chenkai Xu, Xu Wang, Zhenyi Liao, Yishun Li, Tianqi Hou, Zhijie Deng

2025-02-11

Show-o Turbo: Towards Accelerated Unified Multimodal Understanding and
Generation

Summary

This paper talks about Show-o Turbo, a new AI model that improves on an existing system called Show-o. Show-o Turbo is designed to work faster and more efficiently when converting text to images and images to text.

What's the problem?

The original Show-o model, while good at understanding and creating both text and images, is slow because it has to go through many steps to clean up image data and generate text one piece at a time. This makes it inefficient for practical use.

What's the solution?

The researchers created Show-o Turbo by using a technique called consistency distillation, which helps the model learn to generate high-quality results in fewer steps. They also introduced new training methods, like breaking down the learning process into segments and gradually increasing the difficulty, to help the model learn more effectively.

Why it matters?

This matters because Show-o Turbo can create images from text descriptions and write text about images much faster than the original Show-o, without losing much quality. For example, it can generate images in just 4 steps that are better than what Show-o produces in 8 steps. This speed improvement could make AI systems that work with both text and images more practical for real-world applications, potentially leading to better tools for designers, writers, and other creative professionals.

Abstract

There has been increasing research interest in building unified multimodal understanding and generation models, among which Show-o stands as a notable representative, demonstrating great promise for both text-to-image and image-to-text generation. The inference of Show-o involves progressively denoising image tokens and autoregressively decoding text tokens, and hence, unfortunately, suffers from inefficiency issues from both sides. This paper introduces Show-o Turbo to bridge the gap. We first identify a unified denoising perspective for the generation of images and text in Show-o based on the parallel decoding of text tokens. We then propose to extend consistency distillation (CD), a qualified approach for shortening the denoising process of diffusion models, to the multimodal denoising trajectories of Show-o. We introduce a trajectory segmentation strategy and a curriculum learning procedure to improve the training convergence. Empirically, in text-to-image generation, Show-o Turbo displays a GenEval score of 0.625 at 4 sampling steps without using classifier-free guidance (CFG), outperforming that of the original Show-o with 8 steps and CFG; in image-to-text generation, Show-o Turbo exhibits a 1.5x speedup without significantly sacrificing performance. The code is available at https://github.com/zhijie-group/Show-o-Turbo.

View Paper