MotionEdit: Benchmarking and Learning Motion-Centric Image Editing

Yixin Wan, Lei Ke, Wenhao Yu, Kai-Wei Chang, Dong Yu

2025-12-11

MotionEdit: Benchmarking and Learning Motion-Centric Image Editing

Summary

This paper introduces a new dataset and method for editing images to change what someone is *doing* in the picture, while keeping who they are and making sure the changes look realistic.

What's the problem?

Currently, image editing focuses on changing how things *look* – like color or style. There aren't good datasets or tools for changing the *action* happening in an image, like making someone jump instead of walk. Existing attempts either don't look very real or don't have enough examples to train powerful editing models. Existing image editing models struggle with accurately changing motion in images.

What's the solution?

The researchers created a dataset called MotionEdit, filled with pairs of images showing realistic motion changes. They also built a benchmark, MotionEdit-Bench, to test how well editing models can handle these motion-focused edits. To actually improve editing, they developed a technique called MotionNFT which helps models learn to make changes that match the correct motion flow, essentially guiding the model to make realistic action changes without messing up other parts of the image.

Why it matters?

This work is important because it opens the door to more advanced video editing and animation. Imagine being able to easily change what someone is doing in a video just by telling the computer, or creating realistic animations with more control. It pushes the field of image editing beyond just appearance and towards understanding and manipulating actions.

Abstract

We introduce MotionEdit, a novel dataset for motion-centric image editing-the task of modifying subject actions and interactions while preserving identity, structure, and physical plausibility. Unlike existing image editing datasets that focus on static appearance changes or contain only sparse, low-quality motion edits, MotionEdit provides high-fidelity image pairs depicting realistic motion transformations extracted and verified from continuous videos. This new task is not only scientifically challenging but also practically significant, powering downstream applications such as frame-controlled video synthesis and animation. To evaluate model performance on the novel task, we introduce MotionEdit-Bench, a benchmark that challenges models on motion-centric edits and measures model performance with generative, discriminative, and preference-based metrics. Benchmark results reveal that motion editing remains highly challenging for existing state-of-the-art diffusion-based editing models. To address this gap, we propose MotionNFT (Motion-guided Negative-aware Fine Tuning), a post-training framework that computes motion alignment rewards based on how well the motion flow between input and model-edited images matches the ground-truth motion, guiding models toward accurate motion transformations. Extensive experiments on FLUX.1 Kontext and Qwen-Image-Edit show that MotionNFT consistently improves editing quality and motion fidelity of both base models on the motion editing task without sacrificing general editing ability, demonstrating its effectiveness.

View Paper