CLaMP 3: Universal Music Information Retrieval Across Unaligned Modalities and Unseen Languages

Shangda Wu, Zhancheng Guo, Ruibin Yuan, Junyan Jiang, Seungheon Doh, Gus Xia, Juhan Nam, Xiaobing Li, Feng Yu, Maosong Sun

2025-02-17

CLaMP 3: Universal Music Information Retrieval Across Unaligned
Modalities and Unseen Languages

Summary

This paper talks about CLaMP 3, a new AI system that can understand and connect different types of music information like sheet music, audio recordings, and text descriptions in many languages. It's designed to make searching for music easier and more accurate across different formats and languages.

What's the problem?

Current music search systems struggle to work with different types of music information at the same time, like connecting a written description to the right piece of sheet music or audio recording. They also have trouble understanding music descriptions in many different languages, which limits how useful they can be for people around the world.

What's the solution?

The researchers created CLaMP 3, which uses a special learning method to understand connections between different types of music information. They also made a huge dataset called M4-RAG with over 2 million pairs of music and text in 27 languages, and a test set called WikiMT-X to check how well the system works. CLaMP 3 can even understand languages it wasn't specifically trained on, making it very flexible.

Why it matters?

This matters because it could make finding and understanding music much easier for people all over the world. It could help musicians, researchers, and music lovers find the exact piece of music they're looking for, even if they only have a description in their own language or a snippet of the song. This technology could also help preserve and share musical traditions from different cultures by making them more accessible to a global audience.

Abstract

CLaMP 3 is a unified framework developed to address challenges of cross-modal and cross-lingual generalization in music information retrieval. Using contrastive learning, it aligns all major music modalities--including sheet music, performance signals, and audio recordings--with multilingual text in a shared representation space, enabling retrieval across unaligned modalities with text as a bridge. It features a multilingual text encoder adaptable to unseen languages, exhibiting strong cross-lingual generalization. Leveraging retrieval-augmented generation, we curated M4-RAG, a web-scale dataset consisting of 2.31 million music-text pairs. This dataset is enriched with detailed metadata that represents a wide array of global musical traditions. To advance future research, we release WikiMT-X, a benchmark comprising 1,000 triplets of sheet music, audio, and richly varied text descriptions. Experiments show that CLaMP 3 achieves state-of-the-art performance on multiple MIR tasks, significantly surpassing previous strong baselines and demonstrating excellent generalization in multimodal and multilingual music contexts.

View Paper