Qwen2-Audio Technical Report

Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuanjun Lv, Jinzheng He, Junyang Lin, Chang Zhou, Jingren Zhou

2024-07-17

Summary

This paper introduces Qwen2-Audio, a large-scale audio-language model that can understand and respond to audio inputs, allowing for both voice chat and audio analysis.

What's the problem?

Many existing models struggle to effectively process audio instructions and provide accurate responses. Traditional models often require complex setups and can only handle specific tasks, making them less user-friendly and versatile.

What's the solution?

Qwen2-Audio simplifies the interaction process by using natural language prompts instead of complicated tags. It has two main modes: in voice chat mode, users can talk to the model without needing to type anything; in audio analysis mode, users can give both audio and text instructions for the model to analyze sounds. The model is designed to understand various audio inputs, including conversations and commands, enabling it to respond appropriately without needing any special prompts to switch between modes.

Why it matters?

This research is important because it enhances how we interact with AI models through voice, making them more accessible and practical for everyday use. By improving the ability of models like Qwen2-Audio to understand and analyze audio, it opens up new possibilities for applications in customer service, education, and entertainment, where natural communication is key.

Abstract

We introduce the latest progress of Qwen-Audio, a large-scale audio-language model called Qwen2-Audio, which is capable of accepting various audio signal inputs and performing audio analysis or direct textual responses with regard to speech instructions. In contrast to complex hierarchical tags, we have simplified the pre-training process by utilizing natural language prompts for different data and tasks, and have further expanded the data volume. We have boosted the instruction-following capability of Qwen2-Audio and implemented two distinct audio interaction modes for voice chat and audio analysis. In the voice chat mode, users can freely engage in voice interactions with Qwen2-Audio without text input. In the audio analysis mode, users could provide audio and text instructions for analysis during the interaction. Note that we do not use any system prompts to switch between voice chat and audio analysis modes. Qwen2-Audio is capable of intelligently comprehending the content within audio and following voice commands to respond appropriately. For instance, in an audio segment that simultaneously contains sounds, multi-speaker conversations, and a voice command, Qwen2-Audio can directly understand the command and provide an interpretation and response to the audio. Additionally, DPO has optimized the model's performance in terms of factuality and adherence to desired behavior. According to the evaluation results from AIR-Bench, Qwen2-Audio outperformed previous SOTAs, such as Gemini-1.5-pro, in tests focused on audio-centric instruction-following capabilities. Qwen2-Audio is open-sourced with the aim of fostering the advancement of the multi-modal language community.

View Paper