MinMo: A Multimodal Large Language Model for Seamless Voice Interaction

Qian Chen, Yafeng Chen, Yanni Chen, Mengzhe Chen, Yingda Chen, Chong Deng, Zhihao Du, Ruize Gao, Changfeng Gao, Zhifu Gao, Yabin Li, Xiang Lv, Jiaqing Liu, Haoneng Luo, Bin Ma, Chongjia Ni, Xian Shi, Jialong Tang, Hui Wang, Hao Wang, Wen Wang, Yuxuan Wang

2025-01-14

MinMo: A Multimodal Large Language Model for Seamless Voice Interaction

Summary

This paper talks about MinMo, a new AI system that can understand and generate speech really well. It's designed to have natural conversations with people, almost like talking to another human.

What's the problem?

Current AI systems that work with both speech and text have some issues. Some are good at understanding speech but not so great with text, while others are good with text but struggle with speech. Also, many of these systems can't handle a wide range of speech tasks or don't sound natural when they talk.

What's the solution?

The researchers created MinMo, a big AI model with about 8 billion parts. They trained it on a huge amount of speech data - about 1.4 million hours worth. They used a special training method that helps MinMo understand and generate speech in different ways, like turning speech into text, text into speech, and even having two-way conversations. They also made MinMo able to follow instructions about how to speak, like using different emotions or accents.

Why it matters?

This matters because it could make talking to computers feel much more natural and human-like. MinMo can understand speech quickly and respond almost instantly, which could make voice assistants way more useful. It could also help in areas like customer service, where you might want a computer that can have a real conversation. Plus, because MinMo can change how it speaks based on instructions, it could be used in all sorts of creative ways, like in games or entertainment.

Abstract

Recent advancements in large language models (LLMs) and multimodal speech-text models have laid the groundwork for seamless voice interactions, enabling real-time, natural, and human-like conversations. Previous models for voice interactions are categorized as native and aligned. Native models integrate speech and text processing in one framework but struggle with issues like differing sequence lengths and insufficient pre-training. Aligned models maintain text LLM capabilities but are often limited by small datasets and a narrow focus on speech tasks. In this work, we introduce MinMo, a Multimodal Large Language Model with approximately 8B parameters for seamless voice interaction. We address the main limitations of prior aligned multimodal models. We train MinMo through multiple stages of speech-to-text alignment, text-to-speech alignment, speech-to-speech alignment, and duplex interaction alignment, on 1.4 million hours of diverse speech data and a broad range of speech tasks. After the multi-stage training, MinMo achieves state-of-the-art performance across various benchmarks for voice comprehension and generation while maintaining the capabilities of text LLMs, and also facilitates full-duplex conversation, that is, simultaneous two-way communication between the user and the system. Moreover, we propose a novel and simple voice decoder that outperforms prior models in voice generation. The enhanced instruction-following capabilities of MinMo supports controlling speech generation based on user instructions, with various nuances including emotions, dialects, and speaking rates, and mimicking specific voices. For MinMo, the speech-to-text latency is approximately 100ms, full-duplex latency is approximately 600ms in theory and 800ms in practice. The MinMo project web page is https://funaudiollm.github.io/minmo, and the code and models will be released soon.

View Paper