Audio-FLAN: A Preliminary Release

Liumeng Xue, Ziya Zhou, Jiahao Pan, Zixuan Li, Shuai Fan, Yinghao Ma, Sitong Cheng, Dongchao Yang, Haohan Guo, Yujia Xiao, Xinsheng Wang, Zixuan Shen, Chuanbo Zhu, Xinshen Zhang, Tianchi Liu, Ruibin Yuan, Zeyue Tian, Haohe Liu, Emmanouil Benetos, Ge Zhang, Yike Guo, Wei Xue

2025-02-25

Summary

This paper talks about Audio-FLAN, a new dataset designed to help AI models understand and create audio in a more unified way, covering tasks like speech recognition, music generation, and sound analysis.

What's the problem?

Current AI models are good at either understanding audio or generating it, but not both. This split makes it hard to create AI that can work with audio as flexibly as humans do. There's also a lack of large, diverse datasets that combine both audio understanding and generation tasks, which are needed to train more versatile AI models.

What's the solution?

The researchers created Audio-FLAN, a huge dataset with over 100 million examples covering 80 different audio-related tasks. This dataset includes instructions for both understanding audio (like transcribing speech or identifying music) and generating audio (like creating speech or composing music). By combining these tasks in one dataset, Audio-FLAN aims to help train AI models that can handle a wide range of audio tasks without needing specific training for each one.

Why it matters?

This matters because it could lead to AI that's much better at working with audio in all its forms. Imagine having an AI assistant that can not only understand what you say but also create music, identify sounds in your environment, and even mimic voices or sound effects. This kind of versatile audio AI could be incredibly useful in fields like music production, voice assistants, accessibility technology, and even in creating more immersive virtual reality experiences.

Abstract

Recent advancements in audio tokenization have significantly enhanced the integration of audio capabilities into large language models (LLMs). However, audio understanding and generation are often treated as distinct tasks, hindering the development of truly unified audio-language models. While instruction tuning has demonstrated remarkable success in improving generalization and zero-shot learning across text and vision, its application to audio remains largely unexplored. A major obstacle is the lack of comprehensive datasets that unify audio understanding and generation. To address this, we introduce Audio-FLAN, a large-scale instruction-tuning dataset covering 80 diverse tasks across speech, music, and sound domains, with over 100 million instances. Audio-FLAN lays the foundation for unified audio-language models that can seamlessly handle both understanding (e.g., transcription, comprehension) and generation (e.g., speech, music, sound) tasks across a wide range of audio domains in a zero-shot manner. The Audio-FLAN dataset is available on HuggingFace and GitHub and will be continuously updated.

View Paper