VideoUFO: A Million-Scale User-Focused Dataset for Text-to-Video Generation
Wenhao Wang, Yi Yang
2025-03-04
Summary
This paper talks about VideoUFO, a new dataset of over a million video clips designed to help AI models create videos from text descriptions that better match what users actually want to see.
What's the problem?
Current AI models that turn text into videos often don't meet users' expectations because they haven't been trained on the kinds of videos people really want to create. This leads to disappointing results when people try to use these AI tools in real-world situations.
What's the solution?
The researchers created VideoUFO, a huge collection of video clips from YouTube that cover topics people are actually interested in. They used AI to find these topics, search for relevant videos, and create short and long descriptions for each clip. The dataset is carefully made to have very little overlap with existing video datasets and uses only videos that are free to use for research.
Why it matters?
This matters because it could make AI-generated videos much better and more useful for things like making movies, creating video games, or making educational content. By training AI on videos that match what people really want to create, future text-to-video tools could produce more relevant and higher-quality results. This could lead to more creative and diverse video content across many industries.
Abstract
Text-to-video generative models convert textual prompts into dynamic visual content, offering wide-ranging applications in film production, gaming, and education. However, their real-world performance often falls short of user expectations. One key reason is that these models have not been trained on videos related to some topics users want to create. In this paper, we propose VideoUFO, the first Video dataset specifically curated to align with Users' FOcus in real-world scenarios. Beyond this, our VideoUFO also features: (1) minimal (0.29%) overlap with existing video datasets, and (2) videos searched exclusively via YouTube's official API under the Creative Commons license. These two attributes provide future researchers with greater freedom to broaden their training sources. The VideoUFO comprises over 1.09 million video clips, each paired with both a brief and a detailed caption (description). Specifically, through clustering, we first identify 1,291 user-focused topics from the million-scale real text-to-video prompt dataset, VidProM. Then, we use these topics to retrieve videos from YouTube, split the retrieved videos into clips, and generate both brief and detailed captions for each clip. After verifying the clips with specified topics, we are left with about 1.09 million video clips. Our experiments reveal that (1) current 16 text-to-video models do not achieve consistent performance across all user-focused topics; and (2) a simple model trained on VideoUFO outperforms others on worst-performing topics. The dataset is publicly available at https://huggingface.co/datasets/WenhaoWang/VideoUFO under the CC BY 4.0 License.