Multimodal Representation Alignment for Image Generation: Text-Image Interleaved Control Is Easier Than You Think
Liang Chen, Shuai Bai, Wenhao Chai, Weichu Xie, Haozhe Zhao, Leon Vinci, Junyang Lin, Baobao Chang
2025-02-28

Summary
This paper talks about a new way to make AI generate images using both text and image inputs together, called Dream Engine. It's designed to be more flexible and powerful than previous methods that mostly used just text to create images.
What's the problem?
Current AI systems for creating images from text are good, but they struggle when you want to combine ideas from multiple images or use both text and images to guide the creation process. This makes it hard to generate complex images that mix different visual concepts.
What's the solution?
The researchers created Dream Engine, which uses large multimodal models (LMMs) that can understand both text and images. They replaced the text-only parts of existing image generation models with these LMMs, allowing the AI to work with both text and images at the same time. They also developed a two-step training process to teach the AI how to align text and image information and follow complex instructions that mix both.
Why it matters?
This matters because it could make AI image generation much more versatile and powerful. It could allow people to create more complex and specific images by describing what they want in words and also showing example images. This could be useful in fields like design, art, and even in helping people with visual impairments to better understand complex visual concepts. It's also a step towards AI that can understand and work with different types of information (like text and images) in a more human-like way.
Abstract
The field of advanced text-to-image generation is witnessing the emergence of unified frameworks that integrate powerful text encoders, such as CLIP and T5, with Diffusion Transformer backbones. Although there have been efforts to control output images with additional conditions, like canny and depth map, a comprehensive framework for arbitrary text-image interleaved control is still lacking. This gap is especially evident when attempting to merge concepts or visual elements from multiple images in the generation process. To mitigate the gap, we conducted preliminary experiments showing that large multimodal models (LMMs) offer an effective shared representation space, where image and text can be well-aligned to serve as a condition for external diffusion models. Based on this discovery, we propose Dream Engine, an efficient and unified framework designed for arbitrary text-image interleaved control in image generation models. Building on powerful text-to-image models like SD3.5, we replace the original text-only encoders by incorporating versatile multimodal information encoders such as QwenVL. Our approach utilizes a two-stage training paradigm, consisting of joint text-image alignment and multimodal interleaved instruction tuning. Our experiments demonstrate that this training method is effective, achieving a 0.69 overall score on the GenEval benchmark, and matching the performance of state-of-the-art text-to-image models like SD3.5 and FLUX.