WildScore: Benchmarking MLLMs in-the-Wild Symbolic Music Reasoning

Gagan Mundada, Yash Vishe, Amit Namburi, Xin Xu, Zachary Novack, Julian McAuley, Junda Wu

2025-09-08

WildScore: Benchmarking MLLMs in-the-Wild Symbolic Music Reasoning

Summary

This paper introduces a new way to test how well artificial intelligence models, specifically those that can understand both images and text, can reason about music. It focuses on their ability to interpret actual music scores, not just sounds, and answer questions about them.

What's the problem?

Current AI models are really good at many tasks involving images and language, but nobody has really tested how well they can understand the complexities of written music – things like notes, rhythms, and musical structure. There wasn't a good benchmark, or standard test, to measure this ability using real-world music examples and the kinds of questions musicians actually ask.

What's the solution?

The researchers created a dataset called WildScore. This dataset includes real music scores paired with questions and discussions that people have actually had about the music. They also created a system for categorizing different types of musical knowledge to help evaluate the AI models. They then tested several state-of-the-art AI models on WildScore, framing the questions as multiple-choice to make the testing process easier and more consistent.

Why it matters?

This work is important because it highlights both the potential and the limitations of AI in understanding music. It provides a tool for researchers to develop better AI models that can assist musicians with tasks like analysis and composition, and it points out areas where these models still need improvement in order to truly grasp the intricacies of music theory and practice.

Abstract

Recent advances in Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities across various vision-language tasks. However, their reasoning abilities in the multimodal symbolic music domain remain largely unexplored. We introduce WildScore, the first in-the-wild multimodal symbolic music reasoning and analysis benchmark, designed to evaluate MLLMs' capacity to interpret real-world music scores and answer complex musicological queries. Each instance in WildScore is sourced from genuine musical compositions and accompanied by authentic user-generated questions and discussions, capturing the intricacies of practical music analysis. To facilitate systematic evaluation, we propose a systematic taxonomy, comprising both high-level and fine-grained musicological ontologies. Furthermore, we frame complex music reasoning as multiple-choice question answering, enabling controlled and scalable assessment of MLLMs' symbolic music understanding. Empirical benchmarking of state-of-the-art MLLMs on WildScore reveals intriguing patterns in their visual-symbolic reasoning, uncovering both promising directions and persistent challenges for MLLMs in symbolic music reasoning and analysis. We release the dataset and code.

View Paper