InstructX: Towards Unified Visual Editing with MLLM Guidance

Chong Mou, Qichao Sun, Yanze Wu, Pengze Zhang, Xinghui Li, Fulong Ye, Songtao Zhao, Qian He

2025-10-10

InstructX: Towards Unified Visual Editing with MLLM Guidance

Summary

This paper explores how to better use powerful AI models that understand both images and text, called Multimodal Large Language Models (MLLMs), to improve how we edit images and videos using diffusion models, which are known for creating realistic visuals.

What's the problem?

While these combined AI systems are getting better, researchers haven't fully understood *which* parts of the MLLM design are most important for good editing results. Also, it's still really hard to edit videos effectively with these systems; most work focuses on images. Getting image and video editing to work well *together* in one model is a big challenge, and video data for training is limited.

What's the solution?

The researchers created a system called InstructX, which is designed to handle both image and video editing using instructions. They found that if you train the system on a lot of images, it can surprisingly start to edit videos well too, even without being specifically trained on video editing. They also figured out how to use different parts of the MLLM to better handle images versus videos, allowing one model to do both tasks effectively.

Why it matters?

This work is important because it provides a better understanding of how to build AI systems that can edit images and videos based on what you tell them to do. It also shows a clever way to overcome the problem of limited video data by leveraging knowledge gained from images, leading to state-of-the-art editing performance.

Abstract

With recent advances in Multimodal Large Language Models (MLLMs) showing strong visual understanding and reasoning, interest is growing in using them to improve the editing performance of diffusion models. Despite rapid progress, most studies lack an in-depth analysis of MLLM design choices. Moreover, the integration of MLLMs and diffusion models remains an open challenge in some difficult tasks, such as video editing. In this paper, we present InstructX, a unified framework for image and video editing. Specifically, we conduct a comprehensive study on integrating MLLMs and diffusion models for instruction-driven editing across diverse tasks. Building on this study, we analyze the cooperation and distinction between images and videos in unified modeling. (1) We show that training on image data can lead to emergent video editing capabilities without explicit supervision, thereby alleviating the constraints imposed by scarce video training data. (2) By incorporating modality-specific MLLM features, our approach effectively unifies image and video editing tasks within a single model. Extensive experiments demonstrate that our method can handle a broad range of image and video editing tasks and achieves state-of-the-art performance.

View Paper