Presto! Distilling Steps and Layers for Accelerating Music Generation
Zachary Novack, Ge Zhu, Jonah Casebeer, Julian McAuley, Taylor Berg-Kirkpatrick, Nicholas J. Bryan
2024-10-08

Summary
This paper introduces Presto!, a new method designed to speed up the process of generating music from text descriptions while maintaining high quality.
What's the problem?
Generating music from text descriptions using current methods can be slow and inefficient, often requiring many steps to produce high-quality results. This can make it challenging to create music quickly, which is especially important in creative fields where time is valuable.
What's the solution?
To solve this problem, the authors developed Presto!, which focuses on reducing both the number of steps needed to generate music and the cost associated with each step. They introduced a new technique called score-based distribution matching distillation (DMD) to streamline the process for specific types of diffusion models used in music generation. Additionally, they improved an existing method for layer distillation, which helps the model learn better by preserving important information. By combining these two techniques, Presto! can generate music much faster—up to 18 times quicker than previous methods—while still producing high-quality outputs.
Why it matters?
This research is important because it significantly enhances the efficiency of music generation from text, making it one of the fastest methods available. By improving how quickly and effectively music can be created, Presto! opens up new possibilities for artists and developers in the music industry, allowing for more creativity and innovation in how music is produced.
Abstract
Despite advances in diffusion-based text-to-music (TTM) methods, efficient, high-quality generation remains a challenge. We introduce Presto!, an approach to inference acceleration for score-based diffusion transformers via reducing both sampling steps and cost per step. To reduce steps, we develop a new score-based distribution matching distillation (DMD) method for the EDM-family of diffusion models, the first GAN-based distillation method for TTM. To reduce the cost per step, we develop a simple, but powerful improvement to a recent layer distillation method that improves learning via better preserving hidden state variance. Finally, we combine our step and layer distillation methods together for a dual-faceted approach. We evaluate our step and layer distillation methods independently and show each yield best-in-class performance. Our combined distillation method can generate high-quality outputs with improved diversity, accelerating our base model by 10-18x (230/435ms latency for 32 second mono/stereo 44.1kHz, 15x faster than comparable SOTA) -- the fastest high-quality TTM to our knowledge. Sound examples can be found at https://presto-music.github.io/web/.