OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows

Shufan Li, Konstantinos Kallidromitis, Akash Gokul, Zichun Liao, Yusuke Kato, Kazuki Kozuka, Aditya Grover

2024-12-06

OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows

Summary

This paper talks about OmniFlow, a new model that can generate various types of media, like images and audio, from different inputs, making it easier to create content across multiple formats.

What's the problem?

Existing models for generating media often focus on specific tasks, like turning text into images or audio. This limits their flexibility and requires separate models for each type of media, which can be inefficient and complicated.

What's the solution?

OmniFlow introduces a unified approach that allows for any-to-any generation tasks. It uses a technique called rectified flow to handle multiple types of data together. This model can take inputs like text, images, or audio and generate outputs in any of those formats. It also includes a new guidance mechanism that helps users control how different types of data relate to each other in the generated content. This makes it more versatile and efficient than previous models.

Why it matters?

This research is important because it simplifies the process of creating multimedia content by allowing one model to handle various tasks. By improving the way different types of data interact, OmniFlow could lead to advancements in creative fields like music production, video editing, and graphic design, making it easier for creators to produce high-quality content.

Abstract

We introduce OmniFlow, a novel generative model designed for any-to-any generation tasks such as text-to-image, text-to-audio, and audio-to-image synthesis. OmniFlow advances the rectified flow (RF) framework used in text-to-image models to handle the joint distribution of multiple modalities. It outperforms previous any-to-any models on a wide range of tasks, such as text-to-image and text-to-audio synthesis. Our work offers three key contributions: First, we extend RF to a multi-modal setting and introduce a novel guidance mechanism, enabling users to flexibly control the alignment between different modalities in the generated outputs. Second, we propose a novel architecture that extends the text-to-image MMDiT architecture of Stable Diffusion 3 and enables audio and text generation. The extended modules can be efficiently pretrained individually and merged with the vanilla text-to-image MMDiT for fine-tuning. Lastly, we conduct a comprehensive study on the design choices of rectified flow transformers for large-scale audio and text generation, providing valuable insights into optimizing performance across diverse modalities. The Code will be available at https://github.com/jacklishufan/OmniFlows.

View Paper