JAM: A Tiny Flow-based Song Generator with Fine-grained Controllability and Aesthetic Alignment
Renhang Liu, Chia-Yu Hung, Navonil Majumder, Taylor Gautreaux, Amir Ali Bagherzadeh, Chuan Li, Dorien Herremans, Soujanya Poria
2025-07-29
Summary
This paper talks about JAM, a small and efficient AI model that can create songs with precise control over the timing and length of each word in the lyrics. It also improves the sound quality of the generated music.
What's the problem?
The problem is that many song-generating AI models cannot control detailed parts of the music, like how long a word is sung or when exactly it happens, and often the audio quality isn't good enough for real use.
What's the solution?
JAM solves this by using a flow-matching method that lets it control the timing and duration of individual words in songs. It also uses Direct Preference Optimization to make the sound quality better, resulting in music that sounds more natural and expressive.
Why it matters?
This matters because it helps create AI-generated music that sounds better and feels more like real songs, giving musicians and creators new tools to experiment with and produce high-quality music more easily.
Abstract
JAM, a flow-matching-based model, introduces word-level timing and duration control in song generation and uses Direct Preference Optimization to enhance audio quality, outperforming existing models in music-specific attributes.