HeartMuLa: A Family of Open Sourced Music Foundation Models
Dongchao Yang, Yuxin Xie, Yuguo Yin, Zheyu Wang, Xiaoyu Yi, Gongxi Zhu, Xiaolong Weng, Zihan Xiong, Yingzhe Ma, Dading Cong, Jingliang Liu, Zihang Huang, Jinghan Ru, Rongjie Huang, Haoran Wan, Peixu Wang, Kuoxi Yu, Helin Wang, Liming Liang, Xianwei Zhuang, Yuanyuan Wang, Haohan Guo
2026-01-16
Summary
This paper introduces a set of open-source AI models, called 'Heart,' designed to understand and create music in various ways, like matching music to lyrics, recognizing words sung in songs, compressing music efficiently, and actually generating new songs.
What's the problem?
Creating AI that can truly understand and generate high-quality music is really hard. Existing systems often require massive amounts of data and computing power, making it difficult for researchers and smaller companies to participate. There was a need for a powerful, yet accessible, music AI toolkit that could be built upon and improved by the wider community.
What's the solution?
The researchers built four main models. First, HeartCLAP connects audio with text. Second, HeartTranscriptor accurately figures out the lyrics of a song. Third, HeartCodec compresses music in a smart way, keeping the important details while making it easier for the AI to learn patterns. Finally, HeartMuLa is the song generator, using a large language model to create music based on text descriptions, lyrics, or even existing songs. They showed that a system comparable to commercial music generators could be made with resources available to academic institutions, and that increasing the size of HeartMuLa to 7 billion parameters significantly improved its performance.
Why it matters?
These models are important because they provide a free and open starting point for anyone wanting to do research in music AI. They lower the barrier to entry, allowing more people to contribute to the field. They also demonstrate that it’s possible to create high-quality music AI without needing enormous resources, and can be used to create music for things like background music for videos.
Abstract
We present a family of open-source Music Foundation Models designed to advance large-scale music understanding and generation across diverse tasks and modalities. Our framework consists of four major components: (1) HeartCLAP, an audio-text alignment model; (2) HeartTranscriptor, a robust lyric recognition model optimized for real-world music scenarios; and (3) HeartCodec, a low-frame-rate (12.5 Hz) yet high-fidelity music codec tokenizer that captures long-range musical structure while preserving fine-grained acoustic details and enabling efficient autoregressive modeling; (4) HeartMuLa, an LLM-based song generation model capable of synthesizing high-fidelity music under rich, user-controllable conditions (e.g., textual style descriptions, lyrics, and reference audio). In addition, it provides two specialized modes: (i) fine-grained musical attribute control, which allows users to specify the style of different song sections (e.g., intro, verse, chorus) using natural language prompts; and (ii) short, engaging music generation, which is suitable as background music for short videos. Lastly, HeartMuLa improves significantly when scaled to 7B parameters. For the first time, we show that a Suno-level, commercial-grade system can be reproduced using academic-scale data and GPU resources. We expect these foundation models to serve as strong baselines for future research and to facilitate practical applications in multimodal content production.