Audio Flamingo Next: Next-Generation Open Audio-Language Models for Speech, Sound, and Music
Sreyan Ghosh, Arushi Goel, Kaousheik Jayakumar, Lasha Koroshinadze, Nishit Anand, Zhifeng Kong, Siddharth Gururani, Sang-gil Lee, Jaehyeon Kim, Aya Aljafari, Chao-Han Huck Yang, Sungwon Kim, Ramani Duraiswami, Dinesh Manocha, Mohammad Shoeybi, Bryan Catanzaro, Ming-Yu Liu, Wei Ping
2026-04-14
Summary
This paper introduces Audio Flamingo Next, a new and improved artificial intelligence model designed to understand and reason about audio, including speech, sounds, and music, much like a human would.
What's the problem?
Existing audio-understanding AI models weren't very good at handling long audio clips, accurately interpreting complex sounds, or explaining *why* they came to a certain conclusion about what they heard. They also lacked the training data needed to truly master these skills, and struggled to adapt to new, unseen audio tasks.
What's the solution?
The researchers built Audio Flamingo Next by first identifying the weaknesses of a previous version, Audio Flamingo 3. Then, they created a massive amount of new audio data – over a million hours worth! – and used a special training process that gradually improved the model’s abilities. A key innovation is 'Temporal Audio Chain-of-Thought,' which allows the AI to pinpoint *when* in a long audio clip specific events happen, making its reasoning more transparent and accurate. They also released three versions of the model for others to use.
Why it matters?
This research is important because it significantly advances the field of audio AI. Audio Flamingo Next is better at understanding audio than many other models, even those much larger in size. This has real-world implications for things like improved voice assistants, better audio editing tools, and more accurate sound event detection, ultimately making technology more responsive to the sounds around us.
Abstract
We present Audio Flamingo Next (AF-Next), the next-generation and most capable large audio-language model in the Audio Flamingo series, designed to advance understanding and reasoning over speech, environmental sounds and music. Compared to Audio Flamingo 3, AF-Next introduces: (i) a stronger foundational audio-language model that significantly improves accuracy across diverse audio understanding tasks; (ii) scalable strategies for constructing large-scale audio understanding and reasoning data beyond existing academic benchmarks; (iii) support for long and complex audio inputs up to 30 minutes; and (iv) Temporal Audio Chain-of-Thought, a new reasoning paradigm that explicitly grounds intermediate reasoning steps to timestamps in long audio, enabling fine-grained temporal alignment and improved interpretability. To enable these capabilities, we first conduct a systematic analysis of Audio Flamingo 3 to identify key gaps in audio understanding and reasoning. We then curate and scale new large-scale datasets totaling over 1 million hours to address these limitations and expand the existing AudioSkills-XL, LongAudio-XL, AF-Think and AF-Chat datasets. AF-Next is trained using a curriculum-based strategy spanning pre-training, mid-training and post-training stages. Extensive experiments across 20 audio understanding and reasoning benchmarks, including challenging long-audio tasks, show that AF-Next outperforms similarly sized open models by large margins and remains highly competitive with and sometimes surpasses, much larger open-weight and closed models. Beyond benchmark performance, AF-Next exhibits strong real-world utility and transfers well to unseen tasks, highlighting its robustness and generalization ability. In addition to all data, code and methods, we open-source 3 variants of AF-Next, including AF-Next-Instruct, AF-Next-Think and AF-Next-Captioner.