DreamOmni2: Multimodal Instruction-based Editing and Generation
Bin Xia, Bohao Peng, Yuechen Zhang, Junjia Huang, Jiyang Liu, Jingyao Li, Haoru Tan, Sitong Wu, Chengyao Wang, Yitong Wang, Xinglong Wu, Bei Yu, Jiaya Jia
2025-10-10
Summary
This paper introduces a new approach to image editing and generation that combines the strengths of both instruction-following and subject-driven methods, aiming for more practical and versatile results.
What's the problem?
Current image editing tools that use text instructions often aren't specific enough, requiring users to also provide example images to get the desired changes. Similarly, generating images based on subjects is usually limited to simple objects or people and struggles with more abstract ideas or concepts. Basically, existing methods aren't flexible enough to handle complex user requests.
What's the solution?
The researchers created a system called DreamOmni2 that addresses these issues by allowing both text *and* images to be used as instructions, and by expanding the types of concepts that can be manipulated – from concrete things like 'a cat' to abstract ideas like 'happiness'. They built a new dataset using a clever process of mixing image features and generating training data, and they designed a model that can effectively process multiple images at once, avoiding confusion. They also trained their model alongside a powerful visual language model to better understand complex instructions.
Why it matters?
This work is important because it moves image editing and generation closer to being truly useful for everyday users. By allowing for more nuanced instructions and the manipulation of abstract concepts, it opens up possibilities for more creative and precise image manipulation, potentially impacting fields like graphic design, content creation, and even artistic expression. Plus, they're sharing their models and code to help others build on their work.
Abstract
Recent advancements in instruction-based image editing and subject-driven generation have garnered significant attention, yet both tasks still face limitations in meeting practical user needs. Instruction-based editing relies solely on language instructions, which often fail to capture specific editing details, making reference images necessary. Meanwhile, subject-driven generation is limited to combining concrete objects or people, overlooking broader, abstract concepts. To address these challenges, we propose two novel tasks: multimodal instruction-based editing and generation. These tasks support both text and image instructions and extend the scope to include both concrete and abstract concepts, greatly enhancing their practical applications. We introduce DreamOmni2, tackling two primary challenges: data creation and model framework design. Our data synthesis pipeline consists of three steps: (1) using a feature mixing method to create extraction data for both abstract and concrete concepts, (2) generating multimodal instruction-based editing training data using the editing and extraction models, and (3) further applying the extraction model to create training data for multimodal instruction-based editing. For the framework, to handle multi-image input, we propose an index encoding and position encoding shift scheme, which helps the model distinguish images and avoid pixel confusion. Additionally, we introduce joint training with the VLM and our generation/editing model to better process complex instructions. In addition, we have proposed comprehensive benchmarks for these two new tasks to drive their development. Experiments show that DreamOmni2 has achieved impressive results. Models and codes will be released.