Wolf: Captioning Everything with a World Summarization Framework

Boyi Li, Ligeng Zhu, Ran Tian, Shuhan Tan, Yuxiao Chen, Yao Lu, Yin Cui, Sushant Veer, Max Ehrlich, Jonah Philion, Xinshuo Weng, Fuzhao Xue, Andrew Tao, Ming-Yu Liu, Sanja Fidler, Boris Ivanovic, Trevor Darrell, Jitendra Malik, Song Han, Marco Pavone

2024-07-29

Wolf: Captioning Everything with a World Summarization Framework

Summary

This paper introduces Wolf, a new framework designed to automatically generate captions for videos. It uses a combination of advanced models to improve the accuracy and detail of the captions by summarizing information from both images and videos.

What's the problem?

Creating accurate and detailed captions for videos is challenging because it requires understanding complex visual information and context. Current methods often struggle to provide high-quality captions, especially when dealing with dynamic content like driving videos or robotics. This can lead to captions that are either too basic or not representative of what is happening in the video.

What's the solution?

Wolf addresses this problem by using a mixture-of-experts approach, which combines the strengths of different Vision Language Models (VLMs) that analyze both images and videos. It generates captions by first creating detailed descriptions for individual frames and then summarizing these into cohesive video captions. To evaluate how good the captions are, the authors developed a new metric called CapScore, which measures the quality and similarity of generated captions compared to actual human-written ones. Wolf was tested on various datasets and showed significant improvements over existing methods, achieving better scores in both quality and similarity.

Why it matters?

This research is important because it enhances how we understand video content through better captioning, making videos more accessible for people who rely on captions, such as those who are deaf or hard of hearing. By improving video captioning technology, Wolf can help in areas like education, entertainment, and information dissemination, ensuring that more people can benefit from video content.

Abstract

We propose Wolf, a WOrLd summarization Framework for accurate video captioning. Wolf is an automated captioning framework that adopts a mixture-of-experts approach, leveraging complementary strengths of Vision Language Models (VLMs). By utilizing both image and video models, our framework captures different levels of information and summarizes them efficiently. Our approach can be applied to enhance video understanding, auto-labeling, and captioning. To evaluate caption quality, we introduce CapScore, an LLM-based metric to assess the similarity and quality of generated captions compared to the ground truth captions. We further build four human-annotated datasets in three domains: autonomous driving, general scenes, and robotics, to facilitate comprehensive comparisons. We show that Wolf achieves superior captioning performance compared to state-of-the-art approaches from the research community (VILA1.5, CogAgent) and commercial solutions (Gemini-Pro-1.5, GPT-4V). For instance, in comparison with GPT-4V, Wolf improves CapScore both quality-wise by 55.6% and similarity-wise by 77.4% on challenging driving videos. Finally, we establish a benchmark for video captioning and introduce a leaderboard, aiming to accelerate advancements in video understanding, captioning, and data alignment. Leaderboard: https://wolfv0.github.io/leaderboard.html.

View Paper