MiraData: A Large-Scale Video Dataset with Long Durations and Structured Captions

Xuan Ju, Yiming Gao, Zhaoyang Zhang, Ziyang Yuan, Xintao Wang, Ailing Zeng, Yu Xiong, Qiang Xu, Ying Shan

2024-07-10

MiraData: A Large-Scale Video Dataset with Long Durations and Structured Captions

Summary

This paper talks about MiraData, a new and improved video dataset that includes long videos with detailed captions. It aims to help researchers create better video generation models, especially for high-motion videos like those seen in Sora.

What's the problem?

The main problem is that current publicly available video datasets are not suitable for generating high-quality, high-motion videos. Most existing datasets contain short clips with low motion and simple captions, which limits their usefulness for advanced video generation tasks that require more complex and dynamic content.

What's the solution?

To solve this issue, the authors created MiraData, which includes long-duration videos that are rich in detail and motion. They carefully selected videos from various sources and used a model called GPT-4V to generate structured captions that describe the videos in depth from multiple perspectives. Additionally, they developed a new evaluation tool called MiraBench to measure the quality of the videos based on factors like motion intensity and visual consistency. This allows researchers to better assess how well their models perform with these videos.

Why it matters?

This research is important because it provides a high-quality resource for developing and testing video generation models. By offering longer videos with detailed descriptions, MiraData enables advancements in creating more realistic and engaging video content. This can benefit various fields such as entertainment, education, and virtual reality by improving how machines understand and generate video.

Abstract

Sora's high-motion intensity and long consistent videos have significantly impacted the field of video generation, attracting unprecedented attention. However, existing publicly available datasets are inadequate for generating Sora-like videos, as they mainly contain short videos with low motion intensity and brief captions. To address these issues, we propose MiraData, a high-quality video dataset that surpasses previous ones in video duration, caption detail, motion strength, and visual quality. We curate MiraData from diverse, manually selected sources and meticulously process the data to obtain semantically consistent clips. GPT-4V is employed to annotate structured captions, providing detailed descriptions from four different perspectives along with a summarized dense caption. To better assess temporal consistency and motion intensity in video generation, we introduce MiraBench, which enhances existing benchmarks by adding 3D consistency and tracking-based motion strength metrics. MiraBench includes 150 evaluation prompts and 17 metrics covering temporal consistency, motion strength, 3D consistency, visual quality, text-video alignment, and distribution similarity. To demonstrate the utility and effectiveness of MiraData, we conduct experiments using our DiT-based video generation model, MiraDiT. The experimental results on MiraBench demonstrate the superiority of MiraData, especially in motion strength.

View Paper