GenExam: A Multidisciplinary Text-to-Image Exam
Zhaokai Wang, Penghao Yin, Xiangyu Zhao, Changyao Tian, Yu Qiao, Wenhai Wang, Jifeng Dai, Gen Luo
2025-09-18
Summary
This paper introduces a new way to test how well AI models can truly 'think' and create, by giving them tests similar to school exams that require both knowledge and the ability to generate images from text.
What's the problem?
Current AI benchmarks, or tests, are good at checking if a model *understands* information or can *reason* with it, and others check if a model knows facts about the world when creating images. However, there wasn't a good way to test if an AI could combine all these skills – understanding, reasoning, *and* creating – like you would need to do on a challenging exam. Existing tests didn't really push AI to show a deep, integrated understanding through image creation.
What's the solution?
The researchers created 'GenExam,' a set of 1,000 exam questions across 10 different subjects. These questions require the AI to generate images based on text prompts, just like drawing an answer on an exam. Each question has a correct answer image and specific points for how accurate and realistic the generated image needs to be. They then tested several advanced AI models, like GPT-Image-1 and Gemini, on these exams.
Why it matters?
This is important because it provides a much harder test for AI than what currently exists. The results showed that even the best AI models struggled, scoring very low on the GenExam. This suggests that while AI is getting better, it still has a long way to go before it can truly demonstrate 'general intelligence' – the ability to learn, understand, and create across many different areas, much like a human can.
Abstract
Exams are a fundamental test of expert-level intelligence and require integrated understanding, reasoning, and generation. Existing exam-style benchmarks mainly focus on understanding and reasoning tasks, and current generation benchmarks emphasize the illustration of world knowledge and visual concepts, neglecting the evaluation of rigorous drawing exams. We introduce GenExam, the first benchmark for multidisciplinary text-to-image exams, featuring 1,000 samples across 10 subjects with exam-style prompts organized under a four-level taxonomy. Each problem is equipped with ground-truth images and fine-grained scoring points to enable a precise evaluation of semantic correctness and visual plausibility. Experiments show that even state-of-the-art models such as GPT-Image-1 and Gemini-2.5-Flash-Image achieve less than 15% strict scores, and most models yield almost 0%, suggesting the great challenge of our benchmark. By framing image generation as an exam, GenExam offers a rigorous assessment of models' ability to integrate knowledge, reasoning, and generation, providing insights on the path to general AGI.