MARBLE: A Hard Benchmark for Multimodal Spatial Reasoning and Planning

Yulun Jiang, Yekun Chai, Maria Brbić, Michael Moor

2025-07-01

MARBLE: A Hard Benchmark for Multimodal Spatial Reasoning and Planning

Summary

This paper talks about MARBLE, a very challenging test designed to see how well multimodal language models can handle complex reasoning tasks that involve both images and text, making plans step-by-step under physical and spatial rules.

What's the problem?

Most current AI models are weak at combining and reasoning through different types of information like pictures and words in a multi-step way, and they often fail to solve problems requiring understanding of spatial and physical constraints.

What's the solution?

MARBLE provides two difficult tasks inspired by puzzle games where AI models must carefully plan multiple steps to solve challenges that require understanding both visual and textual input. It focuses not just on getting the right answer but on evaluating the reasoning process itself, exposing where models struggle most.

Why it matters?

This matters because it shows the limits of today's AI in reasoning with complex, real-world problems involving vision and language, and it encourages researchers to build new models that can think and plan better across multiple steps in multimodal settings.

Abstract

MARBLE is a challenging multimodal reasoning benchmark that tests the step-by-step reasoning capabilities of multimodal language models across complex tasks, highlighting their limitations in perception and reasoning.

View Paper