Diversity-Rewarded CFG Distillation
Geoffrey Cideron, Andrea Agostinelli, Johan Ferret, Sertan Girgin, Romuald Elie, Olivier Bachem, Sarah Perrin, Alexandre Ramé
2024-10-10

Summary
This paper introduces a method called diversity-rewarded CFG distillation, which improves generative models like MusicLM by enhancing their ability to create diverse and high-quality music without increasing processing costs.
What's the problem?
Generative models, particularly those used for music creation, often rely on a technique called Classifier-Free Guidance (CFG) to help them produce better outputs. However, using CFG can double the time it takes to generate music and limit the originality and variety of the results. This makes it challenging to create unique and creative compositions.
What's the solution?
The authors propose a new approach called diversity-rewarded CFG distillation. This method involves two main goals: first, teaching the model to mimic the outputs of CFG without actually using it during generation, and second, encouraging the model to create diverse outputs through reinforcement learning. By fine-tuning the model this way, they can achieve high-quality results without the extra processing time associated with CFG. Additionally, they introduce a technique for merging models that focus on quality and diversity, allowing for better control over the final output.
Why it matters?
This research is significant because it enhances how generative models can create music, making them more efficient and capable of producing a wider variety of sounds. By improving both quality and diversity in music generation, this method can benefit artists, game developers, and anyone else who relies on AI for creative tasks, ultimately leading to richer musical experiences.
Abstract
Generative models are transforming creative domains such as music generation, with inference-time strategies like Classifier-Free Guidance (CFG) playing a crucial role. However, CFG doubles inference cost while limiting originality and diversity across generated contents. In this paper, we introduce diversity-rewarded CFG distillation, a novel finetuning procedure that distills the strengths of CFG while addressing its limitations. Our approach optimises two training objectives: (1) a distillation objective, encouraging the model alone (without CFG) to imitate the CFG-augmented predictions, and (2) an RL objective with a diversity reward, promoting the generation of diverse outputs for a given prompt. By finetuning, we learn model weights with the ability to generate high-quality and diverse outputs, without any inference overhead. This also unlocks the potential of weight-based model merging strategies: by interpolating between the weights of two models (the first focusing on quality, the second on diversity), we can control the quality-diversity trade-off at deployment time, and even further boost performance. We conduct extensive experiments on the MusicLM (Agostinelli et al., 2023) text-to-music generative model, where our approach surpasses CFG in terms of quality-diversity Pareto optimality. According to human evaluators, our finetuned-then-merged model generates samples with higher quality-diversity than the base model augmented with CFG. Explore our generations at https://google-research.github.io/seanet/musiclm/diverse_music/.