Learning Action and Reasoning-Centric Image Editing from Videos and Simulations
Benno Krojer, Dheeraj Vattikonda, Luis Lara, Varun Jampani, Eva Portelance, Christopher Pal, Siva Reddy
2024-07-09

Summary
This paper talks about a new image editing model that can perform complex edits based on actions and reasoning, called the AURORA model. It focuses on improving how models understand and edit images by using high-quality training data from videos and simulations.
What's the problem?
The main problem is that existing image editing models struggle with more complicated tasks that involve actions or reasoning, like moving objects or changing their attributes in a meaningful way. Most current models work well for simple edits but lack the ability to handle dynamic changes because they are trained on static images, which do not capture the complexities of real-world actions.
What's the solution?
To address this issue, the authors created the AURORA Dataset, which includes carefully curated training data sourced from videos and simulations. This dataset consists of triplets that show a source image, a prompt describing a change, and a target image that reflects that change. By focusing on minimal yet meaningful edits, the AURORA model can learn to perform more sophisticated image edits. They also developed a new benchmark called AURORA-Bench to evaluate the model's performance across various editing tasks. The AURORA model was found to significantly outperform previous models based on human evaluations.
Why it matters?
This research is important because it advances the capabilities of image editing technology, allowing for more realistic and context-aware edits. By improving how AI understands and processes visual information related to actions and reasoning, this work can enhance applications in fields like film production, video games, and graphic design, ultimately making digital content creation more intuitive and powerful.
Abstract
An image editing model should be able to perform diverse edits, ranging from object replacement, changing attributes or style, to performing actions or movement, which require many forms of reasoning. Current general instruction-guided editing models have significant shortcomings with action and reasoning-centric edits. Object, attribute or stylistic changes can be learned from visually static datasets. On the other hand, high-quality data for action and reasoning-centric edits is scarce and has to come from entirely different sources that cover e.g. physical dynamics, temporality and spatial reasoning. To this end, we meticulously curate the AURORA Dataset (Action-Reasoning-Object-Attribute), a collection of high-quality training data, human-annotated and curated from videos and simulation engines. We focus on a key aspect of quality training data: triplets (source image, prompt, target image) contain a single meaningful visual change described by the prompt, i.e., truly minimal changes between source and target images. To demonstrate the value of our dataset, we evaluate an AURORA-finetuned model on a new expert-curated benchmark (AURORA-Bench) covering 8 diverse editing tasks. Our model significantly outperforms previous editing models as judged by human raters. For automatic evaluations, we find important flaws in previous metrics and caution their use for semantically hard editing tasks. Instead, we propose a new automatic metric that focuses on discriminative understanding. We hope that our efforts : (1) curating a quality training dataset and an evaluation benchmark, (2) developing critical evaluations, and (3) releasing a state-of-the-art model, will fuel further progress on general image editing.