Video Instruction Tuning With Synthetic Data

Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, Chunyuan Li

2024-10-04

Video Instruction Tuning With Synthetic Data

Summary

This paper introduces a new dataset called LLaVA-Video-178K, which is designed to help train video models to follow instructions by providing high-quality synthetic data.

What's the problem?

Creating effective video models that can understand and respond to instructions is challenging because there isn't enough high-quality data available. Collecting real-world video data can be time-consuming and difficult, and existing datasets may not contain the variety needed for training these models effectively.

What's the solution?

To solve this problem, the authors developed LLaVA-Video-178K, a synthetic dataset that includes 178,510 videos with detailed captions, open-ended questions, and multiple-choice questions. This dataset is specifically designed for video instruction-following tasks. By combining this new dataset with existing visual instruction data, they created a new model called LLaVA-Video. Their experiments showed that this model performs well on various video benchmarks, demonstrating the effectiveness of the synthetic data.

Why it matters?

This research is important because it provides a way to generate high-quality training data for video models without needing to rely solely on real-world data. By improving how these models learn from videos, LLaVA-Video can enhance applications in areas like education, entertainment, and virtual assistants, making them more capable of understanding and responding to user instructions.

Abstract

The development of video large multimodal models (LMMs) has been hindered by the difficulty of curating large amounts of high-quality raw data from the web. To address this, we propose an alternative approach by creating a high-quality synthetic dataset specifically for video instruction-following, namely LLaVA-Video-178K. This dataset includes key tasks such as detailed captioning, open-ended question-answering (QA), and multiple-choice QA. By training on this dataset, in combination with existing visual instruction tuning data, we introduce LLaVA-Video, a new video LMM. Our experiments demonstrate that LLaVA-Video achieves strong performance across various video benchmarks, highlighting the effectiveness of our dataset. We plan to release the dataset, its generation pipeline, and the model checkpoints.

View Paper