OmniTransfer: All-in-one Framework for Spatio-temporal Video Transfer

Pengze Zhang, Yanze Wu, Mengtian Li, Xu Bai, Songtao Zhao, Fulong Ye, Chong Mou, Xinghui Li, Zhuowei Chen, Qian He, Mingyuan Gao

2026-01-21

OmniTransfer: All-in-one Framework for Spatio-temporal Video Transfer

Summary

This paper introduces a new method called OmniTransfer for changing videos in various ways, like altering their style, adding effects, or changing the camera movement.

What's the problem?

Currently, most video editing techniques need a specific example video to copy from, or they only work well for one particular type of change. This limits how easily and creatively you can edit videos because they don't fully use all the information already present *within* the video itself to make changes. They struggle with both making things look consistent across all frames and controlling exactly *when* changes happen in the video.

What's the solution?

OmniTransfer solves this by looking at all parts of the video – across different frames and over time – to understand what’s happening. It uses three main ideas: first, it smartly uses information from example videos to either improve how things look or how the timing of changes works. Second, it separates the example video’s information from the video being edited to make the transfer more precise and efficient. Finally, it uses clues from different sources (like text descriptions) to figure out what kind of change is being requested and adjust accordingly.

Why it matters?

This research is important because it allows for more flexible and realistic video editing. OmniTransfer can achieve results comparable to methods that require extra information, like tracking body movements, but without needing that extra data. This opens the door to creating high-quality, customized videos more easily and with greater control over the final result, representing a significant step forward in video generation technology.

Abstract

Videos convey richer information than images or text, capturing both spatial and temporal dynamics. However, most existing video customization methods rely on reference images or task-specific temporal priors, failing to fully exploit the rich spatio-temporal information inherent in videos, thereby limiting flexibility and generalization in video generation. To address these limitations, we propose OmniTransfer, a unified framework for spatio-temporal video transfer. It leverages multi-view information across frames to enhance appearance consistency and exploits temporal cues to enable fine-grained temporal control. To unify various video transfer tasks, OmniTransfer incorporates three key designs: Task-aware Positional Bias that adaptively leverages reference video information to improve temporal alignment or appearance consistency; Reference-decoupled Causal Learning separating reference and target branches to enable precise reference transfer while improving efficiency; and Task-adaptive Multimodal Alignment using multimodal semantic guidance to dynamically distinguish and tackle different tasks. Extensive experiments show that OmniTransfer outperforms existing methods in appearance (ID and style) and temporal transfer (camera movement and video effects), while matching pose-guided methods in motion transfer without using pose, establishing a new paradigm for flexible, high-fidelity video generation.

View Paper