InstructMix2Mix: Consistent Sparse-View Editing Through Multi-View Model Personalization

Daniel Gilo, Or Litany

2025-11-24

InstructMix2Mix: Consistent Sparse-View Editing Through Multi-View Model Personalization

Summary

This paper introduces a new method, InstructMix2Mix (I-Mix2Mix), for editing images taken from multiple viewpoints at once, based on a text description of the desired changes.

What's the problem?

Currently, editing images from multiple angles simultaneously is difficult. Existing techniques often create noticeable flaws or inconsistencies – meaning the edits don't look right when viewed from different perspectives. These methods struggle to make edits that appear natural and unified across all the images.

What's the solution?

I-Mix2Mix solves this by taking a powerful 2D image editing model (a diffusion model) and transferring its editing skills to a model that already understands how 3D scenes work. It’s like teaching a 3D artist how to paint using techniques from a 2D artist. The key is a clever way of updating the 3D model during the learning process, carefully controlling the noise added to the images, and adjusting how the model pays attention to different views to ensure everything stays consistent.

Why it matters?

This research is important because it allows for more realistic and seamless editing of 3D scenes from multiple viewpoints. Imagine being able to change the color of a car in a series of photos taken from different angles, and having it look perfectly natural from every angle – that’s the kind of improvement this method enables, which has applications in areas like virtual reality, content creation, and robotics.

Abstract

We address the task of multi-view image editing from sparse input views, where the inputs can be seen as a mix of images capturing the scene from different viewpoints. The goal is to modify the scene according to a textual instruction while preserving consistency across all views. Existing methods, based on per-scene neural fields or temporal attention mechanisms, struggle in this setting, often producing artifacts and incoherent edits. We propose InstructMix2Mix (I-Mix2Mix), a framework that distills the editing capabilities of a 2D diffusion model into a pretrained multi-view diffusion model, leveraging its data-driven 3D prior for cross-view consistency. A key contribution is replacing the conventional neural field consolidator in Score Distillation Sampling (SDS) with a multi-view diffusion student, which requires novel adaptations: incremental student updates across timesteps, a specialized teacher noise scheduler to prevent degeneration, and an attention modification that enhances cross-view coherence without additional cost. Experiments demonstrate that I-Mix2Mix significantly improves multi-view consistency while maintaining high per-frame edit quality.

View Paper