DynVFX: Augmenting Real Videos with Dynamic Content

Danah Yatim, Rafail Fridman, Omer Bar-Tal, Tali Dekel

2025-02-07

DynVFX: Augmenting Real Videos with Dynamic Content

Summary

This paper talks about DynVFX, a new AI system that can add realistic special effects to videos based on simple text instructions. It creates dynamic objects or scene changes that blend naturally with the original video.

What's the problem?

Adding special effects to videos often requires expensive software and professional skills. Current methods for automated video editing struggle to make effects look realistic, especially when objects need to move or interact with the scene over time.

What's the solution?

The researchers developed DynVFX, which uses advanced AI models like diffusion transformers and vision-language models to understand the input video and text instructions. It creates new effects by analyzing how the scene moves and interacts, ensuring the added content fits naturally with camera motion and other objects. The process is fully automated and doesn't require extra training.

Why it matters?

This research is important because it makes high-quality video editing accessible to everyone. With just a text description, users can create professional-looking effects that seamlessly integrate into their videos. This could revolutionize fields like filmmaking, advertising, and social media content creation by simplifying the process of adding dynamic visual elements.

Abstract

We present a method for augmenting real-world videos with newly generated dynamic content. Given an input video and a simple user-provided text instruction describing the desired content, our method synthesizes dynamic objects or complex scene effects that naturally interact with the existing scene over time. The position, appearance, and motion of the new content are seamlessly integrated into the original footage while accounting for camera motion, occlusions, and interactions with other dynamic objects in the scene, resulting in a cohesive and realistic output video. We achieve this via a zero-shot, training-free framework that harnesses a pre-trained text-to-video diffusion transformer to synthesize the new content and a pre-trained Vision Language Model to envision the augmented scene in detail. Specifically, we introduce a novel inference-based method that manipulates features within the attention mechanism, enabling accurate localization and seamless integration of the new content while preserving the integrity of the original scene. Our method is fully automated, requiring only a simple user instruction. We demonstrate its effectiveness on a wide range of edits applied to real-world videos, encompassing diverse objects and scenarios involving both camera and object motion.

View Paper