LeviTor: 3D Trajectory Oriented Image-to-Video Synthesis

Hanlin Wang, Hao Ouyang, Qiuyu Wang, Wen Wang, Ka Leong Cheng, Qifeng Chen, Yujun Shen, Limin Wang

2024-12-20

LeviTor: 3D Trajectory Oriented Image-to-Video Synthesis

Summary

This paper introduces LeviTor, a new method for creating videos from images by allowing users to control the movement of objects in 3D space. It enhances the traditional drag-and-drop interface by adding depth, making it easier to create realistic animations.

What's the problem?

Current methods for generating videos from images often rely on dragging objects in a 2D space, which can lead to confusion when trying to move objects in three dimensions. This limitation makes it hard for users to accurately control how objects appear to move in a realistic environment.

What's the solution?

LeviTor solves this problem by introducing a way for users to assign depth to each point they drag, allowing for more precise control over object trajectories in 3D space. The method uses a few key points to represent the objects and combines this with depth information. This allows the system to create realistic movements and animations when generating videos from static images.

Why it matters?

This research is important because it improves how we can animate and manipulate objects in videos, making it more intuitive and accessible for users. By advancing the technology behind image-to-video synthesis, LeviTor opens up new possibilities for creativity in fields like filmmaking, gaming, and virtual reality.

Abstract

The intuitive nature of drag-based interaction has led to its growing adoption for controlling object trajectories in image-to-video synthesis. Still, existing methods that perform dragging in the 2D space usually face ambiguity when handling out-of-plane movements. In this work, we augment the interaction with a new dimension, i.e., the depth dimension, such that users are allowed to assign a relative depth for each point on the trajectory. That way, our new interaction paradigm not only inherits the convenience from 2D dragging, but facilitates trajectory control in the 3D space, broadening the scope of creativity. We propose a pioneering method for 3D trajectory control in image-to-video synthesis by abstracting object masks into a few cluster points. These points, accompanied by the depth information and the instance information, are finally fed into a video diffusion model as the control signal. Extensive experiments validate the effectiveness of our approach, dubbed LeviTor, in precisely manipulating the object movements when producing photo-realistic videos from static images. Project page: https://ppetrichor.github.io/levitor.github.io/

View Paper