Aligning Generative Music AI with Human Preferences: Methods and Challenges

Dorien Herremans, Abhinaba Roy

2025-11-20

Aligning Generative Music AI with Human Preferences: Methods and Challenges

Summary

This paper discusses how recent AI music generators, while sounding good and being able to create different styles, often don't quite 'get' what people actually *want* to hear. It argues for using techniques that specifically focus on aligning the AI's output with human musical tastes.

What's the problem?

Current AI music systems are built to optimize for technical goals, like sounding realistic or matching a certain genre. However, musical enjoyment is subjective and complex. There's a disconnect between what a computer thinks is 'good' music and what a person actually likes, especially when it comes to things like how a song flows over time, whether the chords sound pleasing, and overall quality which is hard to define mathematically.

What's the solution?

The paper looks at several recent AI advancements that try to solve this problem. These include systems that learn from people's preferences, frameworks that consider multiple preferences at once, and methods that refine the music even as it's being generated. They specifically mention MusicRL, DiffRhythm+, and Text2midi-InferAlign as examples of these techniques being applied to music, tackling issues like keeping the music consistent and sounding good throughout a longer piece.

Why it matters?

Getting AI to create music people genuinely enjoy is important because it could lead to amazing new tools for musicians, allowing them to collaborate with AI in creative ways. It could also power personalized music services that create soundtracks perfectly tailored to your taste, and generally make music creation more accessible and enjoyable for everyone. This requires collaboration between computer scientists and music experts.

Abstract

Recent advances in generative AI for music have achieved remarkable fidelity and stylistic diversity, yet these systems often fail to align with nuanced human preferences due to the specific loss functions they use. This paper advocates for the systematic application of preference alignment techniques to music generation, addressing the fundamental gap between computational optimization and human musical appreciation. Drawing on recent breakthroughs including MusicRL's large-scale preference learning, multi-preference alignment frameworks like diffusion-based preference optimization in DiffRhythm+, and inference-time optimization techniques like Text2midi-InferAlign, we discuss how these techniques can address music's unique challenges: temporal coherence, harmonic consistency, and subjective quality assessment. We identify key research challenges including scalability to long-form compositions, reliability amongst others in preference modelling. Looking forward, we envision preference-aligned music generation enabling transformative applications in interactive composition tools and personalized music services. This work calls for sustained interdisciplinary research combining advances in machine learning, music-theory to create music AI systems that truly serve human creative and experiential needs.

View Paper