Stream-Omni: Simultaneous Multimodal Interactions with Large Language-Vision-Speech Model

Shaolei Zhang, Shoutao Guo, Qingkai Fang, Yan Zhou, Yang Feng

2025-06-18

Stream-Omni: Simultaneous Multimodal Interactions with Large
Language-Vision-Speech Model

Summary

This paper talks about Stream-Omni, a large AI model that can understand and work with text, images, and speech all at the same time by using smart ways to connect these different kinds of information.

What's the problem?

The problem is that it’s very challenging to create AI models that can handle language, vision, and speech together because these types of data are very different and often need a lot of training data to understand well.

What's the solution?

The researchers designed Stream-Omni to join text, images, and speech by linking information along different parts of the AI model’s structure. They combine visual data by connecting sequences and handle speech by aligning layers in the model. This approach makes it possible for the model to learn efficiently with fewer examples.

Why it matters?

This matters because it helps build AI that can understand conversations, images, and sounds all together, making it more useful for real-world applications like virtual assistants, smart devices, and more interactive technology.

Abstract

Stream-Omni, a large multimodal model, integrates text, vision, and speech by efficiently aligning modalities using sequence-dimension concatenation for vision and layer-dimension mapping for speech, achieving strong performance with less data.

View Paper