SAMA: Factorized Semantic Anchoring and Motion Alignment for Instruction-Guided Video Editing

Xinyao Zhang, Wenkai Dong, Yuxin Song, Bo Fang, Qi Zhang, Jing Wang, Fan Chen, Hui Zhang, Haocheng Feng, Yu Lu, Hang Zhou, Chun Yuan, Jingdong Wang

2026-03-20

SAMA: Factorized Semantic Anchoring and Motion Alignment for Instruction-Guided Video Editing

Summary

This paper introduces a new method, SAMA, for editing videos based on text instructions, aiming to make the edits more accurate and keep the video's motion looking natural.

What's the problem?

Current video editing models that follow instructions often struggle to get both the *content* of the edit right and maintain realistic movement within the video. They usually rely on outside information to help, which limits how well they can handle different kinds of videos or instructions – they aren't very flexible or adaptable.

What's the solution?

SAMA tackles this by breaking down video editing into two main parts: understanding *what* to change (semantic anchoring) and preserving *how* it moves (motion alignment). It first learns to identify key moments in the video and what those moments *mean* based on the instructions. Then, it’s trained separately to understand how videos naturally flow and move, using tasks like filling in missing parts or changing the speed. Finally, these two parts are combined and refined with specific editing examples, but importantly, a lot of the learning happens *before* seeing those examples.

Why it matters?

This research is important because it creates a more robust and versatile video editing system. By learning the fundamentals of motion and meaning separately, SAMA doesn’t need as much specific guidance and can edit videos more effectively, even ones it hasn’t seen before. It performs as well as, or even better than, some professional video editing tools.

Abstract

Current instruction-guided video editing models struggle to simultaneously balance precise semantic modifications with faithful motion preservation. While existing approaches rely on injecting explicit external priors (e.g., VLM features or structural conditions) to mitigate these issues, this reliance severely bottlenecks model robustness and generalization. To overcome this limitation, we present SAMA (factorized Semantic Anchoring and Motion Alignment), a framework that factorizes video editing into semantic anchoring and motion modeling. First, we introduce Semantic Anchoring, which establishes a reliable visual anchor by jointly predicting semantic tokens and video latents at sparse anchor frames, enabling purely instruction-aware structural planning. Second, Motion Alignment pre-trains the same backbone on motion-centric video restoration pretext tasks (cube inpainting, speed perturbation, and tube shuffle), enabling the model to internalize temporal dynamics directly from raw videos. SAMA is optimized with a two-stage pipeline: a factorized pre-training stage that learns inherent semantic-motion representations without paired video-instruction editing data, followed by supervised fine-tuning on paired editing data. Remarkably, the factorized pre-training alone already yields strong zero-shot video editing ability, validating the proposed factorization. SAMA achieves state-of-the-art performance among open-source models and is competitive with leading commercial systems (e.g., Kling-Omni). Code, models, and datasets will be released.

View Paper