Video-STaR: Self-Training Enables Video Instruction Tuning with Any Supervision

Orr Zohar, Xiaohan Wang, Yonatan Bitton, Idan Szpektor, Serena Yeung-Levy

2024-07-10

Video-STaR: Self-Training Enables Video Instruction Tuning with Any Supervision

Summary

This paper talks about Video-STaR, a new self-training method designed to improve how large vision language models (LVLMs) learn from video data. It allows these models to use any labeled video dataset for better instruction tuning, which helps them understand videos more effectively.

What's the problem?

The main problem is that existing datasets used to train LVLMs for video tasks are often not diverse enough. They typically come from generating question-and-answer pairs based on video captions, which means they mostly focus on describing the videos rather than reasoning about them. While there are many labeled video datasets available, integrating them into LVLMs is challenging and not straightforward.

What's the solution?

To solve this issue, the authors developed Video-STaR, which enables LVLMs to self-train using any labeled video dataset. The process involves the model generating answers to questions about the videos and then filtering these answers to keep only those that match the original video labels. The model is then retrained on this filtered data. This approach allows the model to learn from existing labels without needing extensive new data, effectively using them as weak supervision for training.

Why it matters?

This research is important because it enhances the performance of LVLMs in understanding and analyzing videos. By improving how these models are trained with diverse datasets, Video-STaR can lead to better results in tasks like video question answering and action recognition. This advancement can have significant applications in areas such as education, entertainment, and security, where accurate video analysis is crucial.

Abstract

The performance of Large Vision Language Models (LVLMs) is dependent on the size and quality of their training datasets. Existing video instruction tuning datasets lack diversity as they are derived by prompting large language models with video captions to generate question-answer pairs, and are therefore mostly descriptive. Meanwhile, many labeled video datasets with diverse labels and supervision exist - however, we find that their integration into LVLMs is non-trivial. Herein, we present Video Self-Training with augmented Reasoning (Video-STaR), the first video self-training approach. Video-STaR allows the utilization of any labeled video dataset for video instruction tuning. In Video-STaR, an LVLM cycles between instruction generation and finetuning, which we show (I) improves general video understanding and (II) adapts LVLMs to novel downstream tasks with existing supervision. During generation, an LVLM is prompted to propose an answer. The answers are then filtered only to those that contain the original video labels, and the LVLM is then re-trained on the generated dataset. By only training on generated answers that contain the correct video labels, Video-STaR utilizes these existing video labels as weak supervision for video instruction tuning. Our results demonstrate that Video-STaR-enhanced LVLMs exhibit improved performance in (I) general video QA, where TempCompass performance improved by 10%, and (II) on downstream tasks, where Video-STaR improved Kinetics700-QA accuracy by 20% and action quality assessment on FineDiving by 15%.

View Paper