StreamingVLM: Real-Time Understanding for Infinite Video Streams

Ruyi Xu, Guangxuan Xiao, Yukang Chen, Liuning He, Kelly Peng, Yao Lu, Song Han

2025-10-13

StreamingVLM: Real-Time Understanding for Infinite Video Streams

Summary

This paper introduces a new vision-language model called StreamingVLM, designed to understand continuous video streams in real-time without running into performance issues.

What's the problem?

Current vision-language models struggle with long videos. Processing the entire video at once requires a lot of computing power and memory, slowing things down. Simply looking at short clips of the video doesn't work well either, because the model can lose track of what's happening over time or take too long to process everything due to repeating calculations.

What's the solution?

StreamingVLM solves this by cleverly managing its memory. It keeps a small, constantly updated record of recent visual and text information, reusing parts of its calculations instead of starting from scratch each time. The model was trained using a technique where it learns to focus on short, overlapping sections of video, which prepares it for processing continuous streams efficiently. This training method doesn't require extremely long videos to work effectively.

Why it matters?

This research is important because it makes it possible to build AI assistants and robots that can understand the world around them in real-time, based on continuous video input. The new model performs well compared to other advanced models like GPT-4O mini, and it even improves the model’s ability to answer general questions about videos, not just those requiring continuous stream understanding.

Abstract

Vision-language models (VLMs) could power real-time assistants and autonomous agents, but they face a critical challenge: understanding near-infinite video streams without escalating latency and memory usage. Processing entire videos with full attention leads to quadratic computational costs and poor performance on long videos. Meanwhile, simple sliding window methods are also flawed, as they either break coherence or suffer from high latency due to redundant recomputation. In this paper, we introduce StreamingVLM, a model designed for real-time, stable understanding of infinite visual input. Our approach is a unified framework that aligns training with streaming inference. During inference, we maintain a compact KV cache by reusing states of attention sinks, a short window of recent vision tokens, and a long window of recent text tokens. This streaming ability is instilled via a simple supervised fine-tuning (SFT) strategy that applies full attention on short, overlapped video chunks, which effectively mimics the inference-time attention pattern without training on prohibitively long contexts. For evaluation, we build Inf-Streams-Eval, a new benchmark with videos averaging over two hours that requires dense, per-second alignment between frames and text. On Inf-Streams-Eval, StreamingVLM achieves a 66.18% win rate against GPT-4O mini and maintains stable, real-time performance at up to 8 FPS on a single NVIDIA H100. Notably, our SFT strategy also enhances general VQA abilities without any VQA-specific fine-tuning, improving performance on LongVideoBench by +4.30 and OVOBench Realtime by +5.96. Code is available at https://github.com/mit-han-lab/streaming-vlm.

View Paper