SketchJudge: A Diagnostic Benchmark for Grading Hand-drawn Diagrams with Multimodal Large Language Models

Yuhang Su, Mei Wang, Yaoyao Zhong, Guozhang Li, Shixing Li, Yihan Feng, Hua Huang

2026-01-13

SketchJudge: A Diagnostic Benchmark for Grading Hand-drawn Diagrams with Multimodal Large Language Models

Summary

This paper introduces a new way to test how well artificial intelligence understands and grades hand-drawn diagrams, like those students create in science and math classes.

What's the problem?

Current AI models, specifically those that combine image and text understanding, are really good at recognizing clear images, but they struggle with messy, hand-drawn sketches. This is a big problem because grading these sketches requires understanding not just *what* is drawn, but *why* a student might have made a mistake – it needs to understand the student's thought process and the structure of the diagram. Existing tests don't really push AI to show this kind of understanding.

What's the solution?

The researchers created a new test called SketchJudge. This test includes over a thousand student-drawn diagrams in subjects like geometry, physics, charts, and flowcharts. These diagrams have different styles and contain various common errors. They then tested several advanced AI models on SketchJudge and found that the AI performed significantly worse than humans at grading these diagrams, proving the test is challenging.

Why it matters?

This work is important because it highlights a weakness in current AI systems. If we want AI to be helpful in education, it needs to be able to understand and provide feedback on student work, even when that work isn't perfect or neatly presented. SketchJudge provides a valuable tool for researchers to develop AI that can better understand and assess student thinking through their diagrams, ultimately leading to better educational tools.

Abstract

While Multimodal Large Language Models (MLLMs) have achieved remarkable progress in visual understanding, they often struggle when faced with the unstructured and ambiguous nature of human-generated sketches. This limitation is particularly pronounced in the underexplored task of visual grading, where models should not only solve a problem but also diagnose errors in hand-drawn diagrams. Such diagnostic capabilities depend on complex structural, semantic, and metacognitive reasoning. To bridge this gap, we introduce SketchJudge, a novel benchmark tailored for evaluating MLLMs as graders of hand-drawn STEM diagrams. SketchJudge encompasses 1,015 hand-drawn student responses across four domains: geometry, physics, charts, and flowcharts, featuring diverse stylistic variations and distinct error types. Evaluations on SketchJudge demonstrate that even advanced MLLMs lag significantly behind humans, validating the benchmark's effectiveness in exposing the fragility of current vision-language alignment in symbolic and noisy contexts. All data, code, and evaluation scripts are publicly available at https://github.com/yuhangsu82/SketchJudge.

View Paper