T2V-Turbo-v2: Enhancing Video Generation Model Post-Training through Data, Reward, and Conditional Guidance Design

Jiachen Li, Qian Long, Jian Zheng, Xiaofeng Gao, Robinson Piramuthu, Wenhu Chen, William Yang Wang

2024-10-10

T2V-Turbo-v2: Enhancing Video Generation Model Post-Training through Data, Reward, and Conditional Guidance Design

Summary

This paper discusses T2V-Turbo-v2, a new method for improving text-to-video (T2V) models by enhancing their performance after initial training using various strategies.

What's the problem?

Text-to-video generation models often struggle to create high-quality videos that align well with the text descriptions, especially when it comes to maintaining consistency and visual quality over longer videos. Traditional methods may not effectively use the data or feedback available to them during the post-training phase.

What's the solution?

The authors introduce T2V-Turbo-v2, which enhances T2V models by integrating high-quality training data, feedback from reward models, and conditional guidance into the training process. This method distills a strong consistency model from a pre-trained T2V model and uses iterative improvements to refine video generation. By extracting motion guidance from training datasets, the model can better understand how to create smooth and coherent motion in videos. The approach has shown significant improvements in video quality and alignment with text descriptions.

Why it matters?

This research is important because it sets a new standard for how text-to-video models can be improved after their initial training. By enhancing the ability of these models to generate high-quality videos that accurately reflect text prompts, T2V-Turbo-v2 can lead to better applications in areas like entertainment, education, and content creation. The findings also provide valuable insights for future research in video generation technology.

Abstract

In this paper, we focus on enhancing a diffusion-based text-to-video (T2V) model during the post-training phase by distilling a highly capable consistency model from a pretrained T2V model. Our proposed method, T2V-Turbo-v2, introduces a significant advancement by integrating various supervision signals, including high-quality training data, reward model feedback, and conditional guidance, into the consistency distillation process. Through comprehensive ablation studies, we highlight the crucial importance of tailoring datasets to specific learning objectives and the effectiveness of learning from diverse reward models for enhancing both the visual quality and text-video alignment. Additionally, we highlight the vast design space of conditional guidance strategies, which centers on designing an effective energy function to augment the teacher ODE solver. We demonstrate the potential of this approach by extracting motion guidance from the training datasets and incorporating it into the ODE solver, showcasing its effectiveness in improving the motion quality of the generated videos with the improved motion-related metrics from VBench and T2V-CompBench. Empirically, our T2V-Turbo-v2 establishes a new state-of-the-art result on VBench, with a Total score of 85.13, surpassing proprietary systems such as Gen-3 and Kling.

View Paper