JMMMU-Pro: Image-based Japanese Multi-discipline Multimodal Understanding Benchmark via Vibe Benchmark Construction

Atsuyuki Miyai, Shota Onohara, Jeonghun Baek, Kiyoharu Aizawa

2025-12-17

JMMMU-Pro: Image-based Japanese Multi-discipline Multimodal Understanding Benchmark via Vibe Benchmark Construction

Summary

This paper introduces a new, challenging test called JMMMU-Pro for evaluating how well AI models understand both images and Japanese text together, and also presents a new way to create these kinds of tests efficiently.

What's the problem?

Current AI models, specifically those called Large Multimodal Models (LMMs), aren't very good at understanding questions that require them to process both an image *and* Japanese text at the same time. Existing tests weren't difficult enough to really show how well these models could handle this combined understanding, and creating good tests is usually expensive and time-consuming.

What's the solution?

The researchers created JMMMU-Pro, which combines the image and the question text into a single image, forcing the AI to truly integrate visual and textual information. To build this test, they used a powerful image generator called Nano Banana Pro to create potential questions visually, then had people check and refine those questions to make sure they were high quality. This method, called Vibe Benchmark Construction, makes creating these tests much cheaper and faster.

Why it matters?

JMMMU-Pro is important because it’s a much harder test than previous ones, and it shows that even the best open-source AI models struggle with it. This highlights a key area where AI needs to improve, specifically in understanding Japanese language and visual information together. The new method for building these tests also provides a useful guide for others who want to create similar evaluations in the future.

Abstract

This paper introduces JMMMU-Pro, an image-based Japanese Multi-discipline Multimodal Understanding Benchmark, and Vibe Benchmark Construction, a scalable construction method. Following the evolution from MMMU to MMMU-Pro, JMMMU-Pro extends JMMMU by composing the question image and question text into a single image, thereby creating a benchmark that requires integrated visual-textual understanding through visual perception. To build JMMMU-Pro, we propose Vibe Benchmark Construction, a methodology in which an image generative model (e.g., Nano Banana Pro) produces candidate visual questions, and humans verify the outputs and, when necessary, regenerate with adjusted prompts to ensure quality. By leveraging Nano Banana Pro's highly realistic image generation capabilities and its ability to embed clean Japanese text, we construct a high-quality benchmark at low cost, covering a wide range of background and layout designs. Experimental results show that all open-source LMMs struggle substantially with JMMMU-Pro, underscoring JMMMU-Pro as an important benchmark for guiding future efforts in the open-source community. We believe that JMMMU-Pro provides a more rigorous evaluation tool for assessing the Japanese capabilities of LMMs and that our Vibe Benchmark Construction also offers an efficient guideline for future development of image-based VQA benchmarks.

View Paper