OpenOmni: Large Language Models Pivot Zero-shot Omnimodal Alignment across Language with Real-time Self-Aware Emotional Speech Synthesis

Run Luo, Ting-En Lin, Haonan Zhang, Yuchuan Wu, Xiong Liu, Min Yang, Yongbin Li, Longze Chen, Jiaming Li, Lei Zhang, Yangyi Chen, Hamid Alinejad-Rokny, Fei Huang

2025-01-08

OpenOmni: Large Language Models Pivot Zero-shot Omnimodal Alignment across Language with Real-time Self-Aware Emotional Speech Synthesis

Summary

This paper talks about OpenOmni, a new AI system that can understand and create content across different forms like text, images, and speech, with a special focus on generating emotional speech in real-time.

What's the problem?

Current AI systems that work with multiple types of content (like text, images, and speech) are mostly owned by big companies and not available to everyone. It's also really hard to make AI that can create emotional speech quickly and naturally. These problems have made it difficult for researchers to develop better open-source AI systems that can work with all these different types of content.

What's the solution?

The researchers created OpenOmni, which uses a two-step process to solve these problems. First, they trained an AI to understand connections between text, images, and speech, even when it hasn't seen examples of all three together. Then, they added a special part to the AI that can quickly create emotional speech. They trained this part using speech tasks and by learning what people prefer in speech.

Why it matters?

This matters because it could lead to more advanced AI assistants that can communicate more naturally and emotionally, just like humans do. It could improve things like virtual assistants, educational tools, and even help in creating more realistic characters in games or movies. By making this system open-source, the researchers are allowing other scientists to build on their work, which could speed up progress in this field of AI.

Abstract

Recent advancements in omnimodal learning have been achieved in understanding and generation across images, text, and speech, though mainly within proprietary models. Limited omnimodal datasets and the inherent challenges associated with real-time emotional speech generation have hindered open-source progress. To address these issues, we propose openomni, a two-stage training method combining omnimodal alignment and speech generation to develop a state-of-the-art omnimodal large language model. In the alignment phase, a pre-trained speech model is further trained on text-image tasks to generalize from vision to speech in a (near) zero-shot manner, outperforming models trained on tri-modal datasets. In the speech generation phase, a lightweight decoder facilitates real-time emotional speech through training on speech tasks and preference learning. Experiments demonstrate that openomni consistently improves across omnimodal, vision-language, and speech-language evaluations, enabling natural, emotion-rich dialogues and real-time emotional speech generation.

View Paper