Learning an Image Editing Model without Image Editing Pairs

Nupur Kumari, Sheng-Yu Wang, Nanxuan Zhao, Yotam Nitzan, Yuheng Li, Krishna Kumar Singh, Richard Zhang, Eli Shechtman, Jun-Yan Zhu, Xun Huang

2025-10-17

Learning an Image Editing Model without Image Editing Pairs

Summary

This paper introduces a new way to edit images using instructions you give in plain language, like 'make the sky bluer'. It's about teaching computers to edit pictures without needing a huge collection of before-and-after examples.

What's the problem?

Currently, teaching image editing models requires a massive amount of paired data – meaning you need tons of original images *and* the exact edited versions you want the model to learn from. Getting this data is really hard and time-consuming. Existing attempts to get around this use fake data created by other models, but this can just copy the flaws of those original models into the new one, making the edits worse.

What's the solution?

The researchers developed a new training method that doesn't need any paired data at all. They directly train a type of image generator called a diffusion model. They use another AI, a vision-language model, to check if the edits made actually follow the instructions and don't mess up the parts of the image that *shouldn't* be changed. This 'feedback' helps the diffusion model learn to edit correctly. They also added a technique to make sure the edited images still look realistic and high-quality.

Why it matters?

This is important because it removes a major obstacle in image editing AI. Without the need for huge datasets, it becomes much easier and cheaper to create powerful image editing tools. The new method performs as well as, and sometimes even better than, existing methods that *do* rely on those large datasets, and it avoids copying errors from other models.

Abstract

Recent image editing models have achieved impressive results while following natural language editing instructions, but they rely on supervised fine-tuning with large datasets of input-target pairs. This is a critical bottleneck, as such naturally occurring pairs are hard to curate at scale. Current workarounds use synthetic training pairs that leverage the zero-shot capabilities of existing models. However, this can propagate and magnify the artifacts of the pretrained model into the final trained model. In this work, we present a new training paradigm that eliminates the need for paired data entirely. Our approach directly optimizes a few-step diffusion model by unrolling it during training and leveraging feedback from vision-language models (VLMs). For each input and editing instruction, the VLM evaluates if an edit follows the instruction and preserves unchanged content, providing direct gradients for end-to-end optimization. To ensure visual fidelity, we incorporate distribution matching loss (DMD), which constrains generated images to remain within the image manifold learned by pretrained models. We evaluate our method on standard benchmarks and include an extensive ablation study. Without any paired data, our method performs on par with various image editing diffusion models trained on extensive supervised paired data, under the few-step setting. Given the same VLM as the reward model, we also outperform RL-based techniques like Flow-GRPO.

View Paper