MIGE: A Unified Framework for Multimodal Instruction-Based Image Generation and Editing

Xueyun Tian, Wei Li, Bingbing Xu, Yige Yuan, Yuanzhuo Wang, Huawei Shen

2025-03-03

MIGE: A Unified Framework for Multimodal Instruction-Based Image
Generation and Editing

Summary

This paper talks about MIGE, a new AI system that combines two tasks: creating images from scratch based on instructions and editing existing images based on instructions. It uses a unified approach to make both tasks work better together.

What's the problem?

Current methods for generating and editing images with AI treat these tasks separately, which limits their ability to handle complex instructions and generalize to new situations. They also struggle with maintaining consistency between the input (like text or an image) and the output.

What's the solution?

The researchers created MIGE, which uses a single framework to handle both image creation and editing. It uses a special encoder that combines visual and text information into one system, allowing the AI to learn from both tasks at the same time. This joint training improves how well the AI follows instructions and keeps the images consistent with the input. MIGE also introduces new ways to process instructions, making it better at handling complex combinations of tasks.

Why it matters?

This matters because MIGE sets a new standard for AI systems that work with images, making them more flexible and accurate. By combining generation and editing into one system, it can handle more creative and complex tasks, like modifying an image while keeping certain features intact. This could be useful in fields like design, advertising, or any area where high-quality image manipulation is needed.

Abstract

Despite significant progress in diffusion-based image generation, subject-driven generation and instruction-based editing remain challenging. Existing methods typically treat them separately, struggling with limited high-quality data and poor generalization. However, both tasks require capturing complex visual variations while maintaining consistency between inputs and outputs. Therefore, we propose MIGE, a unified framework that standardizes task representations using multimodal instructions. It treats subject-driven generation as creation on a blank canvas and instruction-based editing as modification of an existing image, establishing a shared input-output formulation. MIGE introduces a novel multimodal encoder that maps free-form multimodal instructions into a unified vision-language space, integrating visual and semantic features through a feature fusion mechanism.This unification enables joint training of both tasks, providing two key advantages: (1) Cross-Task Enhancement: By leveraging shared visual and semantic representations, joint training improves instruction adherence and visual consistency in both subject-driven generation and instruction-based editing. (2) Generalization: Learning in a unified format facilitates cross-task knowledge transfer, enabling MIGE to generalize to novel compositional tasks, including instruction-based subject-driven editing. Experiments show that MIGE excels in both subject-driven generation and instruction-based editing while setting a state-of-the-art in the new task of instruction-based subject-driven editing. Code and model have been publicly available at https://github.com/Eureka-Maggie/MIGE.

View Paper