PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance

Ruyang Liu, Haoran Tang, Haibo Liu, Yixiao Ge, Ying Shan, Chen Li, Jiankun Yang

2024-11-05

PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance

Summary

This paper introduces PPLLaVA, a new model designed to improve how video-based large language models (LLMs) understand and process both short and long videos. It aims to create a unified approach that works well for videos of any length by addressing the issue of redundant content.

What's the problem?

Most existing video LLMs struggle with understanding long videos, often failing when trying to process hour-long content. Additionally, methods designed specifically for long videos do not perform well on shorter videos or images. This inconsistency makes it difficult to create a model that can handle various video lengths effectively.

What's the solution?

To solve this problem, the authors developed PPLLaVA, which uses a novel pooling strategy to compress video data while maintaining important visual features. The model has three main components: first, it aligns visual information with user instructions to focus on relevant details; second, it compresses the visual sequence using a method similar to convolutional pooling; and third, it extends the context for longer prompts in visual dialogues. This allows PPLLaVA to generate high-quality outputs across different video lengths and tasks, such as generating captions or answering questions about the video content.

Why it matters?

This research is significant because it enhances the capabilities of AI in understanding and generating video content. By enabling better processing of both short and long videos, PPLLaVA can improve applications in areas like education, entertainment, and information retrieval, making it easier for users to interact with and gain insights from video data.

Abstract

The past year has witnessed the significant advancement of video-based large language models. However, the challenge of developing a unified model for both short and long video understanding remains unresolved. Most existing video LLMs cannot handle hour-long videos, while methods custom for long videos tend to be ineffective for shorter videos and images. In this paper, we identify the key issue as the redundant content in videos. To address this, we propose a novel pooling strategy that simultaneously achieves token compression and instruction-aware visual feature aggregation. Our model is termed Prompt-guided Pooling LLaVA, or PPLLaVA for short. Specifically, PPLLaVA consists of three core components: the CLIP-based visual-prompt alignment that extracts visual information relevant to the user's instructions, the prompt-guided pooling that compresses the visual sequence to arbitrary scales using convolution-style pooling, and the clip context extension designed for lengthy prompt common in visual dialogue. Moreover, our codebase also integrates the most advanced video Direct Preference Optimization (DPO) and visual interleave training. Extensive experiments have validated the performance of our model. With superior throughput and only 1024 visual context, PPLLaVA achieves better results on image benchmarks as a video LLM, while achieving state-of-the-art performance across various video benchmarks, excelling in tasks ranging from caption generation to multiple-choice questions, and handling video lengths from seconds to hours. Codes have been available at https://github.com/farewellthree/PPLLaVA.

View Paper