FlowInOne:Unifying Multimodal Generation as Image-in, Image-out Flow Matching

Junchao Yi, Rui Zhao, Jiahao Tang, Weixian Lei, Linjie Li, Qisheng Su, Zhengyuan Yang, Lijuan Wang, Xiaofeng Zhu, Alex Jinpeng Wang

2026-04-09

FlowInOne:Unifying Multimodal Generation as Image-in, Image-out Flow Matching

Summary

This paper introduces a new way to create images from different kinds of inputs, like text descriptions, layouts, and editing commands, by treating everything as a visual signal.

What's the problem?

Traditionally, creating images with AI has relied heavily on text as the starting point, essentially telling the AI what to draw. This approach struggles with truly understanding the visual world and can be limited in how creatively it can work *within* an image, often facing difficulties in aligning different types of input data and requiring separate systems for different tasks.

What's the solution?

The researchers developed a system called FlowInOne that converts all inputs – text, layouts, instructions – into visual 'prompts'. It then uses a single AI model to process these visual prompts and directly generate or modify images. Think of it like giving the AI a visual idea instead of a written one. They also created a large dataset of visual prompts and a benchmark to test how well these systems perform.

Why it matters?

This is a big step because it moves away from relying so much on text, allowing the AI to better understand and manipulate images directly. It simplifies the process of image generation, making it more unified and potentially more powerful, and achieves better results than existing methods, even those used by commercial companies.

Abstract

Multimodal generation has long been dominated by text-driven pipelines where language dictates vision but cannot reason or create within it. We challenge this paradigm by asking whether all modalities, including textual descriptions, spatial layouts, and editing instructions, can be unified into a single visual representation. We present FlowInOne, a framework that reformulates multimodal generation as a purely visual flow, converting all inputs into visual prompts and enabling a clean image-in, image-out pipeline governed by a single flow matching model. This vision-centric formulation naturally eliminates cross-modal alignment bottlenecks, noise scheduling, and task-specific architectural branches, unifying text-to-image generation, layout-guided editing, and visual instruction following under one coherent paradigm. To support this, we introduce VisPrompt-5M, a large-scale dataset of 5 million visual prompt pairs spanning diverse tasks including physics-aware force dynamics and trajectory prediction, alongside VP-Bench, a rigorously curated benchmark assessing instruction faithfulness, spatial precision, visual realism, and content consistency. Extensive experiments demonstrate that FlowInOne achieves state-of-the-art performance across all unified generation tasks, surpassing both open-source models and competitive commercial systems, establishing a new foundation for fully vision-centric generative modeling where perception and creation coexist within a single continuous visual space.

View Paper