TC-Bench: Benchmarking Temporal Compositionality in Text-to-Video and Image-to-Video Generation

Weixi Feng, Jiachen Li, Michael Saxon, Tsu-jui Fu, Wenhu Chen, William Yang Wang

2024-06-14

TC-Bench: Benchmarking Temporal Compositionality in Text-to-Video and Image-to-Video Generation

Summary

This paper introduces TC-Bench, a new benchmark designed to evaluate how well video generation models can create videos that make sense over time. It focuses on the ability of these models to transition smoothly between different visual elements and concepts, similar to how real-life videos work.

What's the problem?

Video generation is more complex than image generation because it involves creating sequences of frames that need to be consistent and coherent over time. Many existing models struggle with this task, often failing to accurately represent changes and relationships between objects as time progresses. This leads to videos that may not make sense or look disjointed.

What's the solution?

To address these issues, the authors developed TC-Bench, which includes carefully crafted text prompts that describe the beginning and end states of a scene. These prompts help guide the video generation process and reduce confusion about what the video should depict. The researchers also created new evaluation metrics to measure how well the generated videos complete transitions between different elements. Their findings showed that most current video generators only achieve less than 20% of the expected compositional changes, indicating a significant need for improvement.

Why it matters?

This research is important because it provides a standardized way to test and improve video generation models, focusing on their ability to handle temporal compositionality. By highlighting the challenges these models face, TC-Bench encourages further development in creating more realistic and coherent video content. This can lead to better applications in entertainment, education, and other fields where video content is essential.

Abstract

Video generation has many unique challenges beyond those of image generation. The temporal dimension introduces extensive possible variations across frames, over which consistency and continuity may be violated. In this study, we move beyond evaluating simple actions and argue that generated videos should incorporate the emergence of new concepts and their relation transitions like in real-world videos as time progresses. To assess the Temporal Compositionality of video generation models, we propose TC-Bench, a benchmark of meticulously crafted text prompts, corresponding ground truth videos, and robust evaluation metrics. The prompts articulate the initial and final states of scenes, effectively reducing ambiguities for frame development and simplifying the assessment of transition completion. In addition, by collecting aligned real-world videos corresponding to the prompts, we expand TC-Bench's applicability from text-conditional models to image-conditional ones that can perform generative frame interpolation. We also develop new metrics to measure the completeness of component transitions in generated videos, which demonstrate significantly higher correlations with human judgments than existing metrics. Our comprehensive experimental results reveal that most video generators achieve less than 20% of the compositional changes, highlighting enormous space for future improvement. Our analysis indicates that current video generation models struggle to interpret descriptions of compositional changes and synthesize various components across different time steps.

View Paper