PerceiverS: A Multi-Scale Perceiver with Effective Segmentation for Long-Term Expressive Symbolic Music Generation

Yungang Yi, Weihua Li, Matthew Kuo, Quan Bai

2024-11-14

PerceiverS: A Multi-Scale Perceiver with Effective Segmentation for Long-Term Expressive Symbolic Music Generation

Summary

This paper presents PerceiverS, a new model designed to generate long and expressive symbolic music by effectively managing both the overall structure and detailed nuances of music.

What's the problem?

Generating symbolic music that is both long and expressive is challenging. Many existing models struggle to maintain coherence over longer pieces while also capturing the subtle details that make music feel alive and engaging.

What's the solution?

The authors developed PerceiverS, which uses two key techniques: Effective Segmentation and Multi-Scale attention. This allows the model to learn the long-term structure of music while also focusing on short-term expressive details. By combining different types of attention mechanisms, PerceiverS can create music that is both coherent and varied. The model was tested on datasets like Maestro, showing significant improvements in generating high-quality music.

Why it matters?

This research is important because it advances the field of music generation, allowing for the creation of more complex and emotionally engaging compositions. By improving how machines generate music, this work could enhance applications in entertainment, education, and art, making AI-generated music more appealing to listeners.

Abstract

Music generation has progressed significantly, especially in the domain of audio generation. However, generating symbolic music that is both long-structured and expressive remains a significant challenge. In this paper, we propose PerceiverS (Segmentation and Scale), a novel architecture designed to address this issue by leveraging both Effective Segmentation and Multi-Scale attention mechanisms. Our approach enhances symbolic music generation by simultaneously learning long-term structural dependencies and short-term expressive details. By combining cross-attention and self-attention in a Multi-Scale setting, PerceiverS captures long-range musical structure while preserving performance nuances. The proposed model, evaluated on datasets like Maestro, demonstrates improvements in generating coherent and diverse music with both structural consistency and expressive variation. The project demos and the generated music samples can be accessed through the link: https://perceivers.github.io.

View Paper