Music Flamingo: Scaling Music Understanding in Audio Language Models
Sreyan Ghosh, Arushi Goel, Lasha Koroshinadze, Sang-gil Lee, Zhifeng Kong, Joao Felipe Santos, Ramani Duraiswami, Dinesh Manocha, Wei Ping, Mohammad Shoeybi, Bryan Catanzaro
2025-11-14
Summary
This paper introduces Music Flamingo, a new computer model designed to really *understand* music, not just recognize sounds. It's a big step forward in getting computers to process music in a way that's closer to how humans do.
What's the problem?
Existing audio and language models struggle with music specifically. Music is complex – it has many layers happening at once, changes over time, and packs a lot of information into a single piece. Also, there wasn't much high-quality data available to train these models to understand music well, so they could only do simple things like give basic descriptions or answer easy questions, and they didn't work well with different types of music from around the world.
What's the solution?
The researchers created a large dataset called MF-Skills, filled with detailed descriptions and questions about music covering things like harmony, song structure, instruments, lyrics, and cultural background. They then used this dataset to train an existing model, Audio Flamingo, making it even better at understanding music. To help the model *think* through musical concepts, they also used a special training process involving music theory and a reward system to encourage good reasoning.
Why it matters?
Music Flamingo is a significant improvement because it can handle more complex music tasks and demonstrates a deeper understanding of music than previous models. It sets a new standard for how computers can perceive and interact with music, potentially leading to future models that can truly appreciate and engage with music like humans do. It also provides a benchmark for other researchers to build upon.
Abstract
We introduce Music Flamingo, a novel large audio-language model designed to advance music (including song) understanding in foundational audio models. While audio-language research has progressed rapidly, music remains challenging due to its dynamic, layered, and information-dense nature. Progress has been further limited by the difficulty of scaling open audio understanding models, primarily because of the scarcity of high-quality music data and annotations. As a result, prior models are restricted to producing short, high-level captions, answering only surface-level questions, and showing limited generalization across diverse musical cultures. To address these challenges, we curate MF-Skills, a large-scale dataset labeled through a multi-stage pipeline that yields rich captions and question-answer pairs covering harmony, structure, timbre, lyrics, and cultural context. We fine-tune an enhanced Audio Flamingo 3 backbone on MF-Skills and further strengthen multiple skills relevant to music understanding. To improve the model's reasoning abilities, we introduce a post-training recipe: we first cold-start with MF-Think, a novel chain-of-thought dataset grounded in music theory, followed by GRPO-based reinforcement learning with custom rewards. Music Flamingo achieves state-of-the-art results across 10+ benchmarks for music understanding and reasoning, establishing itself as a generalist and musically intelligent audio-language model. Beyond strong empirical results, Music Flamingo sets a new standard for advanced music understanding by demonstrating how models can move from surface-level recognition toward layered, human-like perception of songs. We believe this work provides both a benchmark and a foundation for the community to build the next generation of models that engage with music as meaningfully as humans do.