Tora2: Motion and Appearance Customized Diffusion Transformer for Multi-Entity Video Generation

Zhenghao Zhang, Junchao Liao, Xiangyu Meng, Long Qin, Weizhi Wang

2025-07-09

Tora2: Motion and Appearance Customized Diffusion Transformer for
Multi-Entity Video Generation

Summary

This paper talks about Tora2, a new AI system that can create videos with many moving parts, like people and objects, customizing both how they look and how they move along specific paths.

What's the problem?

The problem is that previous video generation models struggled to keep the appearance and motion of multiple entities consistent while letting users control their movement paths accurately, which limited their ability to create realistic and detailed videos.

What's the solution?

The researchers improved the original Tora system by adding a way to extract detailed features about each entity’s appearance separately and a new attention mechanism that combines motion, text descriptions, and visual data more effectively. They also designed a special learning technique to ensure the motion and appearance stay aligned during video creation.

Why it matters?

This matters because Tora2 allows more precise and flexible control over video generation, making it possible to create realistic videos for applications like movies, games, and virtual reality with multiple customized characters or objects moving naturally.

Abstract

Tora2 enhances motion-guided video generation by introducing a decoupled personalization extractor, gated self-attention mechanism, and contrastive loss for improved multimodal conditioning and entity consistency.

View Paper