Goldfish: Vision-Language Understanding of Arbitrarily Long Videos
Kirolos Ataallah, Xiaoqian Shen, Eslam Abdelrahman, Essam Sleiman, Mingchen Zhuge, Jian Ding, Deyao Zhu, Jürgen Schmidhuber, Mohamed Elhoseiny
2024-07-18
Summary
This paper presents Goldfish, a new method for understanding long videos by combining video and language processing to answer questions about the content.
What's the problem?
Most current models that analyze videos can handle short clips well but struggle with longer videos. This is because long videos often contain a lot of unnecessary information (noise) and require more memory and processing power, making it hard for these models to give accurate answers about what happens in the video.
What's the solution?
Goldfish addresses these challenges by using an efficient retrieval system that first identifies the most relevant short clips from a long video based on the user's question. It then provides answers by focusing on these selected clips. To enhance the retrieval process, the authors created MiniGPT4-Video, which generates detailed descriptions of the video clips. Additionally, they established a new benchmark called TVQA-long to evaluate how well models understand long videos by asking questions that cover entire episodes instead of just short segments. Goldfish achieved a 41.78% accuracy rate on this benchmark, which is a significant improvement over previous methods.
Why it matters?
This research is important because it improves how we can analyze and comprehend long videos, which are increasingly common in media like movies and TV shows. By developing better tools for understanding video content, Goldfish can enhance applications in areas such as content creation, education, and entertainment, making it easier for users to interact with and learn from video material.
Abstract
Most current LLM-based models for video understanding can process videos within minutes. However, they struggle with lengthy videos due to challenges such as "noise and redundancy", as well as "memory and computation" constraints. In this paper, we present Goldfish, a methodology tailored for comprehending videos of arbitrary lengths. We also introduce the TVQA-long benchmark, specifically designed to evaluate models' capabilities in understanding long videos with questions in both vision and text content. Goldfish approaches these challenges with an efficient retrieval mechanism that initially gathers the top-k video clips relevant to the instruction before proceeding to provide the desired response. This design of the retrieval mechanism enables the Goldfish to efficiently process arbitrarily long video sequences, facilitating its application in contexts such as movies or television series. To facilitate the retrieval process, we developed MiniGPT4-Video that generates detailed descriptions for the video clips. In addressing the scarcity of benchmarks for long video evaluation, we adapted the TVQA short video benchmark for extended content analysis by aggregating questions from entire episodes, thereby shifting the evaluation from partial to full episode comprehension. We attained a 41.78% accuracy rate on the TVQA-long benchmark, surpassing previous methods by 14.94%. Our MiniGPT4-Video also shows exceptional performance in short video comprehension, exceeding existing state-of-the-art methods by 3.23%, 2.03%, 16.5% and 23.59% on the MSVD, MSRVTT, TGIF, and TVQA short video benchmarks, respectively. These results indicate that our models have significant improvements in both long and short-video understanding. Our models and code have been made publicly available at https://vision-cair.github.io/Goldfish_website/