SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models

Mingze Xu, Mingfei Gao, Zhe Gan, Hong-You Chen, Zhengfeng Lai, Haiming Gang, Kai Kang, Afshin Dehghan

2024-07-23

SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models

Summary

This paper presents SlowFast-LLaVA (SF-LLaVA), a new model for understanding videos without needing extra training. It captures both detailed images and the flow of action over time, making it effective for various video tasks.

What's the problem?

Many existing video models are either too large or require extensive training on specific datasets, which can be time-consuming and resource-intensive. This makes them less practical for real-world applications where quick and efficient processing is needed. Additionally, combining detailed spatial information with motion cues in videos can be challenging without exceeding the limits of typical models.

What's the solution?

The authors developed SF-LLaVA, which uses a two-stream approach to process video frames. The 'Slow' pathway captures detailed images at a lower frame rate, while the 'Fast' pathway focuses on motion by analyzing frames at a higher rate but with reduced detail. This design allows the model to effectively gather both spatial and temporal information without needing extensive training. Experimental results show that SF-LLaVA performs better than other training-free methods and achieves results comparable to advanced models that have been fine-tuned on large datasets.

Why it matters?

This research is important because it provides a more efficient way to analyze videos, making it easier to apply in various fields such as video question-answering, content creation, and surveillance. By improving how machines understand video content without requiring extensive training, SF-LLaVA can help streamline processes in industries that rely on video analysis.

Abstract

We propose SlowFast-LLaVA (or SF-LLaVA for short), a training-free video large language model (LLM) that can jointly capture the detailed spatial semantics and long-range temporal context without exceeding the token budget of commonly used LLMs. This is realized by using a two-stream SlowFast design of inputs for Video LLMs to aggregate features from sampled video frames in an effective way. Specifically, the Slow pathway extracts features at a low frame rate while keeping as many spatial details as possible (e.g., with 24x24 tokens), and the Fast pathway operates on a high frame rate but uses a larger spatial pooling stride (e.g., downsampling 6x) to focus on the motion cues. As a result, this design allows us to adequately capture both spatial and temporal features that are beneficial for understanding details along the video. Experimental results show that SF-LLaVA outperforms existing training-free methods on a wide range of video tasks. On some benchmarks, it achieves comparable or even better performance compared to state-of-the-art Video LLMs that are fine-tuned on video datasets.

View Paper