Multimodal Music Generation with Explicit Bridges and Retrieval Augmentation
Baisen Wang, Le Zhuo, Zhaokai Wang, Chenxi Bao, Wu Chengjing, Xuecheng Nie, Jiao Dai, Jizhong Han, Yue Liao, Si Liu
2024-12-16

Summary
This paper talks about a new method for generating music from different types of inputs, like text, images, and videos, using a system called Visuals Music Bridge (VMB).
What's the problem?
Generating music from various sources can be challenging because existing methods often struggle with not having enough data, aligning different types of inputs effectively, and giving users control over the music they want to create. This can lead to low-quality music that doesn't match the user's expectations.
What's the solution?
The authors introduce VMB, which uses specific connections (or 'bridges') between text and music to improve how these different inputs work together. It includes a model that turns visual inputs into detailed text descriptions and a retrieval system that helps find the right music based on user preferences. This approach allows for better alignment between the input and the generated music, resulting in higher quality and more customizable outputs.
Why it matters?
This research is important because it enhances the ability of AI to create music that is more aligned with user intentions and diverse inputs. By improving how different types of data are combined to generate music, VMB can be used in various fields such as film scoring, video game soundtracks, and personalized music creation, making it easier for artists and creators to produce unique soundtracks.
Abstract
Multimodal music generation aims to produce music from diverse input modalities, including text, videos, and images. Existing methods use a common embedding space for multimodal fusion. Despite their effectiveness in other modalities, their application in multimodal music generation faces challenges of data scarcity, weak cross-modal alignment, and limited controllability. This paper addresses these issues by using explicit bridges of text and music for multimodal alignment. We introduce a novel method named Visuals Music Bridge (VMB). Specifically, a Multimodal Music Description Model converts visual inputs into detailed textual descriptions to provide the text bridge; a Dual-track Music Retrieval module that combines broad and targeted retrieval strategies to provide the music bridge and enable user control. Finally, we design an Explicitly Conditioned Music Generation framework to generate music based on the two bridges. We conduct experiments on video-to-music, image-to-music, text-to-music, and controllable music generation tasks, along with experiments on controllability. The results demonstrate that VMB significantly enhances music quality, modality, and customization alignment compared to previous methods. VMB sets a new standard for interpretable and expressive multimodal music generation with applications in various multimedia fields. Demos and code are available at https://github.com/wbs2788/VMB.