DiagramBank: A Large-scale Dataset of Diagram Design Exemplars with Paper Metadata for Retrieval-Augmented Generation

Tingwen Zhang, Ling Yue, Zhen Xu, Shaowu Pan

2026-04-27

DiagramBank: A Large-scale Dataset of Diagram Design Exemplars with Paper Metadata for Retrieval-Augmented Generation

Summary

This paper introduces DiagramBank, a new collection of scientific diagrams created to help AI systems automatically generate better scientific papers, specifically focusing on the often-overlooked but crucial 'teaser figure'.

What's the problem?

Current 'AI scientist' systems can write papers and code, but they struggle with creating effective teaser figures – those initial, eye-catching graphics that summarize a paper's main idea. These figures aren't just simple data plots; they require a conceptual understanding of the research and the ability to visually communicate complex ideas in a way that grabs attention and sparks interest. Existing AI systems either skip this step or produce low-quality alternatives.

What's the solution?

The researchers built DiagramBank, a large dataset of over 89,000 schematic diagrams taken from published scientific papers. They used a computer program to automatically find and extract these diagrams, then filtered them to ensure they were actually diagrams and not just regular charts or images. Each diagram is paired with information from the paper like the abstract, caption, and where it's referenced in the text, allowing AI to learn what makes a good teaser figure and how it relates to the research. They also provide tools to help AI systems use this dataset to create their own diagrams.

Why it matters?

This work is important because it addresses a key limitation in AI-driven scientific discovery. By providing a dedicated dataset and tools for generating teaser figures, it moves us closer to fully automated paper generation, potentially speeding up the research process and making scientific information more accessible.

Abstract

Recent advances in autonomous ``AI scientist'' systems have demonstrated the ability to automatically write scientific manuscripts and codes with execution. However, producing a publication-grade scientific diagram (e.g., teaser figure) is still a major bottleneck in the ``end-to-end'' paper generation process. For example, a teaser figure acts as a strategic visual interface and serves a different purpose than derivative data plots. It demands conceptual synthesis and planning to translate complex logic workflow into a compelling graphic that guides intuition and sparks curiosity. Existing AI scientist systems usually omit this component or fall back to an inferior alternative. To bridge this gap, we present DiagramBank, a large-scale dataset consisting of 89,422 schematic diagrams curated from existing top-tier scientific publications, designed for multimodal retrieval and exemplar-driven scientific figure generation. DiagramBank is developed through our automated curation pipeline that extracts figures and corresponding in-text references, and uses a CLIP-based filter to differentiate schematic diagrams from standard plots or natural images. Each instance is paired with rich context from abstract, caption, to figure-reference pairs, enabling information retrieval under different query granularities. We release DiagramBank in a ready-to-index format and provide a retrieval-augmented generation codebase to demonstrate exemplar-conditioned synthesis of teaser figures. DiagramBank is publicly available at https://huggingface.co/datasets/zhangt20/DiagramBank with code at https://github.com/csml-rpi/DiagramBank.

View Paper