ViFeEdit: A Video-Free Tuner of Your Video Diffusion Transformer
Ruonan Yu, Zhenxiong Tan, Zigeng Chen, Songhua Liu, Xinchao Wang
2026-03-17
Summary
This paper introduces a new method called ViFeEdit for controlling and editing videos using diffusion transformers, a type of AI model good at generating images and videos.
What's the problem?
While diffusion transformers are great at creating images and videos, it's been harder to control *what* they create specifically for videos. This is because getting lots of example videos paired with instructions on how to change them is difficult, and training these models on videos takes a lot of computing power.
What's the solution?
ViFeEdit solves this by cleverly changing how the diffusion transformer works internally. It separates out how the model understands space (what things look like in each frame) from how it understands time (how things change between frames). This allows the model to be trained using only 2D images, and still create videos that look good and make sense over time, with only a small amount of extra information added to the model.
Why it matters?
This is important because it means we can now edit and generate videos with more control without needing huge amounts of video data or massive computing resources. It opens the door to more accessible and efficient video editing and creation tools powered by AI.
Abstract
Diffusion Transformers (DiTs) have demonstrated remarkable scalability and quality in image and video generation, prompting growing interest in extending them to controllable generation and editing tasks. However, compared to the image counterparts, progress in video control and editing remains limited, mainly due to the scarcity of paired video data and the high computational cost of training video diffusion models. To address this issue, in this paper, we propose a video-free tuning framework termed ViFeEdit for video diffusion transformers. Without requiring any forms of video training data, ViFeEdit achieves versatile video generation and editing, adapted solely with 2D images. At the core of our approach is an architectural reparameterization that decouples spatial independence from the full 3D attention in modern video diffusion transformers, which enables visually faithful editing while maintaining temporal consistency with only minimal additional parameters. Moreover, this design operates in a dual-path pipeline with separate timestep embeddings for noise scheduling, exhibiting strong adaptability to diverse conditioning signals. Extensive experiments demonstrate that our method delivers promising results of controllable video generation and editing with only minimal training on 2D image data. Codes are available https://github.com/Lexie-YU/ViFeEdit.