ING-VP: MLLMs cannot Play Easy Vision-based Games Yet

Haoran Zhang, Hangyu Guo, Shuyue Guo, Meng Cao, Wenhao Huang, Jiaheng Liu, Ge Zhang

2024-10-10

ING-VP: MLLMs cannot Play Easy Vision-based Games Yet

Summary

This paper discusses ING-VP, a new benchmark designed to evaluate how well multimodal large language models (MLLMs) can handle spatial reasoning and planning in interactive games.

What's the problem?

While MLLMs have shown good performance in various tasks, they struggle with understanding and planning in scenarios that require tracking spatial relationships over time, especially in games. Existing benchmarks do not adequately test these abilities, making it hard to assess their true capabilities in complex environments.

What's the solution?

To address this gap, the authors created ING-VP, which includes six different games with a total of 300 levels. Each game allows a single model to play over 60,000 rounds, testing its ability to plan and reason spatially. The benchmark provides different settings for evaluation, such as comparing how MLLMs perform with image and text inputs versus text-only inputs and assessing single-step versus multi-step reasoning. This comprehensive approach helps identify the strengths and weaknesses of various models.

Why it matters?

This research is important because it highlights the limitations of current MLLMs in handling complex spatial reasoning tasks. By establishing a focused benchmark like ING-VP, researchers can better understand where these models fall short and drive improvements in their ability to plan and reason in dynamic environments. This could lead to advancements in AI applications that require sophisticated understanding of space and movement, such as robotics and interactive gaming.

Abstract

As multimodal large language models (MLLMs) continue to demonstrate increasingly competitive performance across a broad spectrum of tasks, more intricate and comprehensive benchmarks have been developed to assess these cutting-edge models. These benchmarks introduce new challenges to core capabilities such as perception, reasoning, and planning. However, existing multimodal benchmarks fall short in providing a focused evaluation of multi-step planning based on spatial relationships in images. To bridge this gap, we present ING-VP, the first INteractive Game-based Vision Planning benchmark, specifically designed to evaluate the spatial imagination and multi-step reasoning abilities of MLLMs. ING-VP features 6 distinct games, encompassing 300 levels, each with 6 unique configurations. A single model engages in over 60,000 rounds of interaction. The benchmark framework allows for multiple comparison settings, including image-text vs. text-only inputs, single-step vs. multi-step reasoning, and with-history vs. without-history conditions, offering valuable insights into the model's capabilities. We evaluated numerous state-of-the-art MLLMs, with the highest-performing model, Claude-3.5 Sonnet, achieving an average accuracy of only 3.37%, far below the anticipated standard. This work aims to provide a specialized evaluation framework to drive advancements in MLLMs' capacity for complex spatial reasoning and planning. The code is publicly available at https://github.com/Thisisus7/ING-VP.git.

View Paper