MUMU: Bootstrapping Multimodal Image Generation from Text-to-Image Data

William Berman, Alexander Peysakhovich

2024-06-28

MUMU: Bootstrapping Multimodal Image Generation from Text-to-Image Data

Summary

This paper talks about MUMU, a new model designed to generate images from mixed prompts that include both text and images. It aims to create diverse and coherent images based on user descriptions.

What's the problem?

Generating images from text descriptions can be challenging, especially when the descriptions involve combining elements from different images. Traditional models often struggle with this because they lack the ability to understand and integrate various visual concepts effectively. Additionally, there is a limited amount of high-quality training data that includes both text and corresponding images, which makes it hard for models to learn how to generate accurate images.

What's the solution?

To address this issue, the authors developed MUMU, which uses a combination of existing text-to-image models. They created a multimodal dataset by extracting meaningful image sections that correspond to words in captions from both synthetically generated and publicly available data. MUMU consists of a vision-language model encoder and a diffusion decoder, allowing it to learn how to combine inputs from different images into a single coherent output. For example, if given a realistic image of a person and a cartoon image, it can produce an image of that person in a cartoon style.

Why it matters?

This research is important because it demonstrates how multimodal models can effectively generate high-quality images based on complex prompts that mix text and visuals. By improving the ability to create diverse images from various inputs, MUMU has the potential to enhance applications in fields like animation, gaming, and virtual reality, where combining different visual styles and elements is crucial.

Abstract

We train a model to generate images from multimodal prompts of interleaved text and images such as "a <picture of a man> man and his <picture of a dog> dog in an <picture of a cartoon> animated style." We bootstrap a multimodal dataset by extracting semantically meaningful image crops corresponding to words in the image captions of synthetically generated and publicly available text-image data. Our model, MUMU, is composed of a vision-language model encoder with a diffusion decoder and is trained on a single 8xH100 GPU node. Despite being only trained on crops from the same image, MUMU learns to compose inputs from different images into a coherent output. For example, an input of a realistic person and a cartoon will output the same person in the cartoon style, and an input of a standing subject and a scooter will output the subject riding the scooter. As a result, our model generalizes to tasks such as style transfer and character consistency. Our results show the promise of using multimodal models as general purpose controllers for image generation.

View Paper