Streaming Video Instruction Tuning

Jiaer Xia, Peixian Chen, Mengdan Zhang, Xing Sun, Kaiyang Zhou

2025-12-25

Summary

This paper introduces Streamo, a new artificial intelligence system that can understand and interact with live video streams like a helpful assistant.

What's the problem?

Current AI models that work with videos usually only do one specific thing, like answering questions about the video or writing captions. They aren't very good at handling the continuous flow of information in a live video and can't perform a variety of tasks at the same time, like narrating what's happening, understanding actions, and answering questions as things unfold.

What's the solution?

The researchers created Streamo by building a large dataset of instructions specifically for understanding streaming video. This dataset, called Streamo-Instruct-465K, helped them train Streamo to do many different video-related tasks simultaneously. They then trained Streamo to follow these instructions, allowing it to understand the timing of events and respond in real-time.

Why it matters?

Streamo represents a significant step towards creating AI that can truly understand video in a way that's similar to how humans do. It bridges the gap between AI that analyzes pre-recorded videos and AI that can act as a smart assistant while watching a live stream, opening the door for more intelligent and interactive video experiences.

Abstract

We present Streamo, a real-time streaming video LLM that serves as a general-purpose interactive assistant. Unlike existing online video models that focus narrowly on question answering or captioning, Streamo performs a broad spectrum of streaming video tasks, including real-time narration, action understanding, event captioning, temporal event grounding, and time-sensitive question answering. To develop such versatility, we construct Streamo-Instruct-465K, a large-scale instruction-following dataset tailored for streaming video understanding. The dataset covers diverse temporal contexts and multi-task supervision, enabling unified training across heterogeneous streaming tasks. After training end-to-end on the instruction-following dataset through a streamlined pipeline, Streamo exhibits strong temporal reasoning, responsive interaction, and broad generalization across a variety of streaming benchmarks. Extensive experiments show that Streamo bridges the gap between offline video perception models and real-time multimodal assistants, making a step toward unified, intelligent video understanding in continuous video streams.

View Paper