GRADE: Benchmarking Discipline-Informed Reasoning in Image Editing

Mingxin Liu, Ziqian Fan, Zhaokai Wang, Leyao Gu, Zirun Zhu, Yiguo He, Yuchen Yang, Changyao Tian, Xiangyu Zhao, Ning Liao, Shaofeng Zhang, Qibing Ren, Zhihang Zhong, Xuanhe Zhou, Junchi Yan, Xue Yang

2026-03-13

GRADE: Benchmarking Discipline-Informed Reasoning in Image Editing

Summary

This paper introduces a new way to test how well AI models can understand and change images, but specifically when that change requires knowledge from different academic fields like science or history.

What's the problem?

Current tests for image editing AI focus on simple, everyday images and don't really challenge the AI to use specific knowledge from school subjects. They don't see if the AI can make edits that make sense within a particular field of study, like correctly changing a science diagram or a historical scene. Basically, existing benchmarks aren't good at measuring if AI can *reason* while editing images.

What's the solution?

The researchers created a new benchmark called GRADE, which includes 520 images across 10 different academic subjects. They also developed a way to score the AI's edits based on three things: how well it understands the subject matter, how realistic the edited image looks, and how logically the changes make sense. They then tested 20 different AI models on this benchmark.

Why it matters?

This work shows that current AI models aren't very good at editing images when it requires specialized knowledge. It highlights where AI needs to improve to be truly useful in fields like education or research, and provides a new tool for researchers to develop and test better AI models that can reason and understand the world like humans do.

Abstract

Unified multimodal models target joint understanding, reasoning, and generation, but current image editing benchmarks are largely confined to natural images and shallow commonsense reasoning, offering limited assessment of this capability under structured, domain-specific constraints. In this work, we introduce GRADE, the first benchmark to assess discipline-informed knowledge and reasoning in image editing. GRADE comprises 520 carefully curated samples across 10 academic domains, spanning from natural science to social science. To support rigorous evaluation, we propose a multi-dimensional evaluation protocol that jointly assesses Discipline Reasoning, Visual Consistency, and Logical Readability. Extensive experiments on 20 state-of-the-art open-source and closed-source models reveal substantial limitations in current models under implicit, knowledge-intensive editing settings, leading to large performance gaps. Beyond quantitative scores, we conduct rigorous analyses and ablations to expose model shortcomings and identify the constraints within disciplinary editing. Together, GRADE pinpoints key directions for the future development of unified multimodal models, advancing the research on discipline-informed image editing and reasoning. Our benchmark and evaluation code are publicly released.

View Paper