Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities

Zhifei Xie, Changqiao Wu

2024-10-21

Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities

Summary

This paper introduces Mini-Omni2, an advanced open-source model that combines vision, speech, and text capabilities, allowing it to interact with users in a more natural and flexible way.

What's the problem?

While models like GPT-4o can handle multiple types of data (like images and audio), creating a single model that effectively combines all these abilities is challenging. This is mainly due to the complexity of the data and the training processes required to make everything work together smoothly. Many existing models can perform some functions, but they often lack the comprehensive capabilities of a fully integrated system.

What's the solution?

To address this issue, the authors developed Mini-Omni2, which uses a three-stage training process to align different types of inputs (visual, auditory, and textual). This model can respond to voice and visual queries in real-time. The training process includes expanding the model's capabilities, aligning different modalities (like text and speech), and finally integrating audio responses. Additionally, Mini-Omni2 features a command-based interruption mechanism that allows users to interact more flexibly by interrupting the model's output when needed.

Why it matters?

This research is important because it enhances how AI models can interact with people by allowing them to understand and respond to multiple forms of input simultaneously. By making Mini-Omni2 open-source, the authors provide a valuable resource for developers and researchers, encouraging further advancements in multi-modal AI technologies that can improve applications like virtual assistants, customer service bots, and more.

Abstract

GPT-4o, an all-encompassing model, represents a milestone in the development of large multi-modal language models. It can understand visual, auditory, and textual modalities, directly output audio, and support flexible duplex interaction. Models from the open-source community often achieve some functionalities of GPT-4o, such as visual understanding and voice chat. Nevertheless, training a unified model that incorporates all modalities is challenging due to the complexities of multi-modal data, intricate model architectures, and training processes. In this paper, we introduce Mini-Omni2, a visual-audio assistant capable of providing real-time, end-to-end voice responses to visoin and audio queries. By integrating pretrained visual and auditory encoders, Mini-Omni2 maintains performance in individual modalities. We propose a three-stage training process to align modalities, allowing the language model to handle multi-modal inputs and outputs after training on a limited dataset. For interaction, we introduce a command-based interruption mechanism, enabling more flexible interaction with users. To the best of our knowledge, Mini-Omni2 is one of the closest reproductions of GPT-4o, which have similar form of functionality, and we hope it can offer valuable insights for subsequent research.

View Paper