Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, Mike Zheng Shou

2024-08-23

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

Summary

This paper introduces Show-o, a new type of transformer model that combines understanding and generating text and images in a flexible way.

What's the problem?

Current models often focus on either generating text or understanding images separately, which limits their ability to handle tasks that require both skills. This can make it difficult to create applications that need to interpret and produce mixed content effectively.

What's the solution?

The authors developed Show-o, which integrates both autoregressive (text generation) and diffusion modeling (used for generating images) into one model. This allows it to adaptively manage different types of inputs and outputs, making it suitable for various tasks like answering questions about images or creating images from text descriptions. Show-o has been tested on many benchmarks and performs as well or better than existing models with similar or more parameters.

Why it matters?

This research is important because it represents a significant step forward in creating versatile AI systems that can understand and generate content across different formats. This could lead to advancements in fields like education, entertainment, and accessibility, where combining text and images is essential.

Abstract

We present a unified transformer, i.e., Show-o, that unifies multimodal understanding and generation. Unlike fully autoregressive models, Show-o unifies autoregressive and (discrete) diffusion modeling to adaptively handle inputs and outputs of various and mixed modalities. The unified model flexibly supports a wide range of vision-language tasks including visual question-answering, text-to-image generation, text-guided inpainting/extrapolation, and mixed-modality generation. Across various benchmarks, it demonstrates comparable or superior performance to existing individual models with an equivalent or larger number of parameters tailored for understanding or generation. This significantly highlights its potential as a next-generation foundation model. Code and models are released at https://github.com/showlab/Show-o.

View Paper