OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts

Yuxuan Wang, Yueqian Wang, Bo Chen, Tong Wu, Dongyan Zhao, Zilong Zheng

2025-04-02

OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming
Video Contexts

Summary

This paper is about creating a new way to test AI models that can understand and respond to videos as they are being streamed, like in a live video call.

What's the problem?

Existing tests for AI models don't really check how well they can understand and react to streaming videos in real-time.

What's the solution?

The researchers created OmniMMI, a test with lots of videos and questions that specifically focus on the AI's ability to understand and respond proactively to streaming video content.

Why it matters?

This work matters because it can help us develop better AI models that can interact with live video streams in a useful way.

Abstract

The rapid advancement of multi-modal language models (MLLMs) like GPT-4o has propelled the development of Omni language models, designed to process and proactively respond to continuous streams of multi-modal data. Despite their potential, evaluating their real-world interactive capabilities in streaming video contexts remains a formidable challenge. In this work, we introduce OmniMMI, a comprehensive multi-modal interaction benchmark tailored for OmniLLMs in streaming video contexts. OmniMMI encompasses over 1,121 videos and 2,290 questions, addressing two critical yet underexplored challenges in existing video benchmarks: streaming video understanding and proactive reasoning, across six distinct subtasks. Moreover, we propose a novel framework, Multi-modal Multiplexing Modeling (M4), designed to enable an inference-efficient streaming model that can see, listen while generating.

View Paper