VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video Comprehension with Video-Text Duet Interaction Format
Yueqian Wang, Xiaojun Meng, Yuxuan Wang, Jianxin Liang, Jiansheng Wei, Huishuai Zhang, Dongyan Zhao
2024-11-28

Summary
This paper presents MMDuet, a new system that improves how video large language models (VideoLLMs) understand and respond to videos in real-time by using a video-text duet interaction format.
What's the problem?
Current VideoLLMs require the entire video and a query to generate responses, which is not effective for live situations like streaming. This method limits their ability to provide timely answers and perform well on tasks that need quick understanding of specific video segments.
What's the solution?
The authors introduce a duet interaction format where the video plays continuously while both the user and the model can insert text messages at any point. This allows for real-time responses as the video plays. They created a new dataset called MMDuetIT to train the model on this interaction style and developed a new task called MAGQA to test its ability to respond quickly. This approach significantly improves performance on time-sensitive tasks without needing extensive retraining.
Why it matters?
This research is important because it enhances how AI can interact with videos, making it more useful for applications like live streaming, video analysis, and education. By allowing real-time communication between users and the model, MMDuet can provide better insights and responses during video playback, leading to a more engaging experience.
Abstract
Recent researches on video large language models (VideoLLM) predominantly focus on model architectures and training datasets, leaving the interaction format between the user and the model under-explored. In existing works, users often interact with VideoLLMs by using the entire video and a query as input, after which the model generates a response. This interaction format constrains the application of VideoLLMs in scenarios such as live-streaming comprehension where videos do not end and responses are required in a real-time manner, and also results in unsatisfactory performance on time-sensitive tasks that requires localizing video segments. In this paper, we focus on a video-text duet interaction format. This interaction format is characterized by the continuous playback of the video, and both the user and the model can insert their text messages at any position during the video playback. When a text message ends, the video continues to play, akin to the alternative of two performers in a duet. We construct MMDuetIT, a video-text training dataset designed to adapt VideoLLMs to video-text duet interaction format. We also introduce the Multi-Answer Grounded Video Question Answering (MAGQA) task to benchmark the real-time response ability of VideoLLMs. Trained on MMDuetIT, MMDuet demonstrates that adopting the video-text duet interaction format enables the model to achieve significant improvements in various time-sensitive tasks (76% CIDEr on YouCook2 dense video captioning, 90\% mAP on QVHighlights highlight detection and 25% R@0.5 on Charades-STA temporal video grounding) with minimal training efforts, and also enable VideoLLMs to reply in a real-time manner as the video plays. Code, data and demo are available at: https://github.com/yellow-binary-tree/MMDuet.