AIBench: Evaluating Visual-Logical Consistency in Academic Illustration Generation

Zhaohe Liao, Kaixun Jiang, Zhihang Liu, Yujie Wei, Junqiu Yu, Quanhao Li, Hong-Tao Yu, Pandeng Li, Yuzheng Wang, Zhen Xing, Shiwei Zhang, Chen-Wei Xie, Yun Zheng, Xihui Liu

2026-04-03

AIBench: Evaluating Visual-Logical Consistency in Academic Illustration Generation

Summary

This paper investigates how well current AI image generators can create illustrations suitable for academic papers, something that hasn't been thoroughly tested before.

What's the problem?

Evaluating if an AI-generated illustration accurately represents the information in a research paper is difficult because it requires a very strong understanding of both the text and the image, almost like having perfect comprehension. Existing methods rely on other AI models to judge this, but those models aren't always reliable when dealing with complex details and long explanations.

What's the solution?

The researchers created a new testing method called AIBench. It uses Visual Question Answering (VQA) – basically, asking the AI questions about the image – to check if the illustration logically matches the paper's methods. They designed questions at different levels of detail, based on diagrams from the papers themselves, to assess the illustration's correctness and also evaluate how visually appealing it is. This approach focuses on logical consistency and reduces the need for a perfect understanding from the judging AI.

Why it matters?

This work shows that AI image generators struggle more with creating accurate academic illustrations than with general image creation. It highlights the difficulty of simultaneously making an image both logically correct and aesthetically pleasing, like a professionally designed illustration. It also suggests that improving the AI's reasoning and generation abilities, even during the image creation process, can significantly improve the quality of these illustrations.

Abstract

Although image generation has boosted various applications via its rapid evolution, whether the state-of-the-art models are able to produce ready-to-use academic illustrations for papers is still largely unexplored. Directly comparing or evaluating the illustration with VLM is native but requires oracle multi-modal understanding ability, which is unreliable for long and complex texts and illustrations. To address this, we propose AIBench, the first benchmark using VQA for evaluating logic correctness of the academic illustrations and VLMs for assessing aesthetics. In detail, we designed four levels of questions proposed from a logic diagram summarized from the method part of the paper, which query whether the generated illustration aligns with the paper on different scales. Our VQA-based approach raises more accurate and detailed evaluations on visual-logical consistency while relying less on the ability of the judger VLM. With our high-quality AIBench, we conduct extensive experiments and conclude that the performance gap between models on this task is significantly larger than general ones, reflecting their various complex reasoning and high-density generation ability. Further, the logic and aesthetics are hard to optimize simultaneously as in handcrafted illustrations. Additional experiments further state that test-time scaling on both abilities significantly boosts the performance on this task.

View Paper