StreamChat: Chatting with Streaming Video

Jihao Liu, Zhiding Yu, Shiyi Lan, Shihao Wang, Rongyao Fang, Jan Kautz, Hongsheng Li, Jose M. Alvare

2024-12-12

StreamChat: Chatting with Streaming Video

Summary

This paper talks about StreamChat, a new method that improves how large multimodal models (LMMs) interact with streaming video, allowing for real-time responses based on the latest video content.

What's the problem?

Existing methods for interacting with streaming videos often rely on outdated visual information, which means the model can only use what it sees at the moment a question is asked. This leads to delays and less accurate responses because the model doesn't account for changes in the video that happen after the question is posed.

What's the solution?

StreamChat solves this problem by updating the visual context at each step of generating a response. It uses a special architecture that efficiently processes streaming video inputs and maintains up-to-date information throughout the interaction. Additionally, it introduces a new dataset to train models specifically for these types of interactions and employs a method to encode the timing of visual and text information. This allows the model to respond more accurately and quickly to changes in the video.

Why it matters?

This research is important because it enhances the capabilities of AI models in understanding and interacting with dynamic video content. By improving how these models respond to real-time changes, StreamChat can be used in various applications such as live streaming, virtual meetings, and interactive media, making interactions smoother and more engaging.

Abstract

This paper presents StreamChat, a novel approach that enhances the interaction capabilities of Large Multimodal Models (LMMs) with streaming video content. In streaming interaction scenarios, existing methods rely solely on visual information available at the moment a question is posed, resulting in significant delays as the model remains unaware of subsequent changes in the streaming video. StreamChat addresses this limitation by innovatively updating the visual context at each decoding step, ensuring that the model utilizes up-to-date video content throughout the decoding process. Additionally, we introduce a flexible and efficient crossattention-based architecture to process dynamic streaming inputs while maintaining inference efficiency for streaming interactions. Furthermore, we construct a new dense instruction dataset to facilitate the training of streaming interaction models, complemented by a parallel 3D-RoPE mechanism that encodes the relative temporal information of visual and text tokens. Experimental results demonstrate that StreamChat achieves competitive performance on established image and video benchmarks and exhibits superior capabilities in streaming interaction scenarios compared to state-of-the-art video LMM.

View Paper