MDK12-Bench: A Multi-Discipline Benchmark for Evaluating Reasoning in Multimodal Large Language Models

Pengfei Zhou, Fanrui Zhang, Xiaopeng Peng, Zhaopan Xu, Jiaxin Ai, Yansheng Qiu, Chuanhao Li, Zhen Li, Ming Li, Yukang Feng, Jianwen Sun, Haoquan Zhang, Zizhen Li, Xiaofeng Mao, Wangbo Zhao, Kai Wang, Xiaojun Chang, Wenqi Shao, Yang You, Kaipeng Zhang

2025-04-15

MDK12-Bench: A Multi-Discipline Benchmark for Evaluating Reasoning in
Multimodal Large Language Models

Summary

This paper introduces MDK12-Bench, which is a new way to test how well large language models that can handle both text and images (called multimodal large language models) can reason through problems in different school subjects.

What's the problem?

The problem is that most current tests for these advanced AI models only focus on one subject or type of question, so it's hard to know if these models are actually good at understanding and reasoning across a wide range of real-world topics like what students face in school.

What's the solution?

To solve this, the researchers created MDK12-Bench, a benchmark made up of real educational tests from many different subjects. They use this to check how well these multimodal language models can reason and answer questions that involve both words and pictures, just like students do in school exams.

Why it matters?

This matters because it helps us see if these AI models are truly ready to help with real educational tasks, not just simple or single-topic problems. By testing them in a way that matches real school challenges, we can better understand their strengths and weaknesses and work on making them more useful for education.

Abstract

MDK12-Bench evaluates multimodal reasoning capabilities of MLLMs using diverse real-world educational tests across multiple disciplines.

View Paper