Tora: Trajectory-oriented Diffusion Transformer for Video Generation
Zhenghao Zhang, Junchao Liao, Menghao Li, Long Qin, Weizhi Wang
2024-08-01

Summary
This paper introduces Tora, a new framework for generating videos that focuses on the paths or 'trajectories' that objects take. It uses advanced techniques to create high-quality videos by combining text, images, and motion information.
What's the problem?
Many existing video generation methods struggle to create videos that accurately reflect how objects move over time. They often produce short clips with limited detail and don't effectively capture the complex changes in motion that occur throughout a video. This makes it hard to generate realistic and coherent video content.
What's the solution?
To solve this problem, the authors developed Tora, which includes several key components: a Trajectory Extractor that identifies and encodes movement paths, a Spatial-Temporal Diffusion Transformer that generates the video content, and a Motion-guidance Fuser that ensures the generated videos follow the specified trajectories. By integrating these elements, Tora can produce videos that not only look good but also accurately depict how objects move in a natural way.
Why it matters?
This research is important because it enhances the ability to create realistic videos for various applications, such as film production, video games, and virtual reality. By allowing for precise control over motion in generated videos, Tora can help creators produce more engaging and lifelike content, making it a valuable tool in the field of video generation.
Abstract
Recent advancements in Diffusion Transformer (DiT) have demonstrated remarkable proficiency in producing high-quality video content. Nonetheless, the potential of transformer-based diffusion models for effectively generating videos with controllable motion remains an area of limited exploration. This paper introduces Tora, the first trajectory-oriented DiT framework that integrates textual, visual, and trajectory conditions concurrently for video generation. Specifically, Tora consists of a Trajectory Extractor~(TE), a Spatial-Temporal DiT, and a Motion-guidance Fuser~(MGF). The TE encodes arbitrary trajectories into hierarchical spacetime motion patches with a 3D video compression network. The MGF integrates the motion patches into the DiT blocks to generate consistent videos following trajectories. Our design aligns seamlessly with DiT's scalability, allowing precise control of video content's dynamics with diverse durations, aspect ratios, and resolutions. Extensive experiments demonstrate Tora's excellence in achieving high motion fidelity, while also meticulously simulating the movement of the physical world. Page can be found at https://ali-videoai.github.io/tora_video.