Foundation Models for Music: A Survey
Yinghao Ma, Anders Øland, Anton Ragni, Bleiz MacSen Del Sette, Charalampos Saitis, Chris Donahue, Chenghua Lin, Christos Plachouras, Emmanouil Benetos, Elio Quinton, Elona Shatri, Fabio Morreale, Ge Zhang, György Fazekas, Gus Xia, Huan Zhang, Ilaria Manco, Jiawen Huang, Julien Guinot, Liwei Lin, Luca Marinelli, Max W. Y. Lam
2024-08-27

Summary
This paper provides a comprehensive review of foundation models in music, exploring their impact, development, and potential applications in various fields.
What's the problem?
While foundation models like large language models (LLMs) and latent diffusion models (LDMs) have made significant advancements in many areas, their application in music is still underdeveloped. Many existing methods do not effectively utilize the complexity of music data, leading to limitations in music generation and understanding.
What's the solution?
The authors analyze different types of foundation models used in music and highlight the need for better training methods and architectures. They discuss important topics like how to improve models for music generation, understanding, and even medical applications. They also emphasize the importance of ethical considerations in AI music research, such as copyright issues and transparency.
Why it matters?
This research matters because it aims to shape the future of human-AI collaboration in music. By identifying gaps in current technology and suggesting improvements, the paper helps pave the way for more advanced and responsible use of AI in creating and understanding music.
Abstract
In recent years, foundation models (FMs) such as large language models (LLMs) and latent diffusion models (LDMs) have profoundly impacted diverse sectors, including music. This comprehensive review examines state-of-the-art (SOTA) pre-trained models and foundation models in music, spanning from representation learning, generative learning and multimodal learning. We first contextualise the significance of music in various industries and trace the evolution of AI in music. By delineating the modalities targeted by foundation models, we discover many of the music representations are underexplored in FM development. Then, emphasis is placed on the lack of versatility of previous methods on diverse music applications, along with the potential of FMs in music understanding, generation and medical application. By comprehensively exploring the details of the model pre-training paradigm, architectural choices, tokenisation, finetuning methodologies and controllability, we emphasise the important topics that should have been well explored, like instruction tuning and in-context learning, scaling law and emergent ability, as well as long-sequence modelling etc. A dedicated section presents insights into music agents, accompanied by a thorough analysis of datasets and evaluations essential for pre-training and downstream tasks. Finally, by underscoring the vital importance of ethical considerations, we advocate that following research on FM for music should focus more on such issues as interpretability, transparency, human responsibility, and copyright issues. The paper offers insights into future challenges and trends on FMs for music, aiming to shape the trajectory of human-AI collaboration in the music realm.