Visual Autoregressive Modeling for Instruction-Guided Image Editing
Qingyang Mao, Qi Cai, Yehao Li, Yingwei Pan, Mingyue Cheng, Ting Yao, Qi Liu, Tao Mei
2025-08-22
Summary
This paper introduces a new approach to editing images called VAREdit, which aims to make edits more accurate and faster than current methods that rely on diffusion models.
What's the problem?
Existing image editing techniques using diffusion models often change parts of the image you *don't* want to be changed. This happens because the models look at the whole image at once when making edits, causing unintended consequences and making it hard to follow specific instructions. It's like trying to fix a small spot on a painting but accidentally smudging the surrounding colors.
What's the solution?
VAREdit takes a different approach by building the edited image step-by-step, like writing a sentence word by word. It uses a 'visual autoregressive' method, predicting the image at different levels of detail. The key innovation is a 'Scale-Aligned Reference' module, which helps the model understand how the fine details of the original image relate to the bigger picture when making edits. This ensures the edits are precise and consistent with the instructions.
Why it matters?
VAREdit is a significant improvement because it's much better at following editing instructions and is significantly faster than existing methods. It achieves a 30% or more improvement in editing quality and completes edits in about 1.2 seconds, which is over twice as fast as some other similar tools. This means more accurate and efficient image editing, which is important for a lot of applications like graphic design, photo manipulation, and even creating art.
Abstract
Recent advances in diffusion models have brought remarkable visual fidelity to instruction-guided image editing. However, their global denoising process inherently entangles the edited region with the entire image context, leading to unintended spurious modifications and compromised adherence to editing instructions. In contrast, autoregressive models offer a distinct paradigm by formulating image synthesis as a sequential process over discrete visual tokens. Their causal and compositional mechanism naturally circumvents the adherence challenges of diffusion-based methods. In this paper, we present VAREdit, a visual autoregressive (VAR) framework that reframes image editing as a next-scale prediction problem. Conditioned on source image features and text instructions, VAREdit generates multi-scale target features to achieve precise edits. A core challenge in this paradigm is how to effectively condition the source image tokens. We observe that finest-scale source features cannot effectively guide the prediction of coarser target features. To bridge this gap, we introduce a Scale-Aligned Reference (SAR) module, which injects scale-matched conditioning information into the first self-attention layer. VAREdit demonstrates significant advancements in both editing adherence and efficiency. On standard benchmarks, it outperforms leading diffusion-based methods by 30\%+ higher GPT-Balance score. Moreover, it completes a 512times512 editing in 1.2 seconds, making it 2.2times faster than the similarly sized UltraEdit. The models are available at https://github.com/HiDream-ai/VAREdit.