Factuality Matters: When Image Generation and Editing Meet Structured Visuals

Le Zhuo, Songhao Han, Yuandong Pu, Boxiang Qiu, Sayak Paul, Yue Liao, Yihao Liu, Jie Shao, Xi Chen, Si Liu, Hongsheng Li

2025-10-07

Factuality Matters: When Image Generation and Editing Meet Structured Visuals

Summary

This research focuses on improving how AI creates and changes structured images like charts and diagrams, which are different from typical photos. Current AI models are good at making realistic pictures, but struggle with visuals that need precise arrangement, text, and accurate information.

What's the problem?

Existing AI image generators aren't very good at creating or editing things like graphs, flowcharts, or even mathematical equations. These types of images require more than just making something look nice; they need to be logically organized, include correctly rendered text, and accurately represent data. The AI needs to 'understand' what the chart *means* to create it properly, and current models lack this ability.

What's the solution?

The researchers built a large dataset of over a million structured images, created using drawing programs and paired with explanations of how they were made. They then trained an AI model by combining a visual language model with another system called FLUX.1 Kontext, using a special connection to help it understand both images and text better. The training happened in stages, gradually improving the model's ability to align visual elements, incorporate knowledge, and use reasoning. They also created a new way to test these models, called StructBench, which includes over 1,700 difficult examples and a scoring system that checks for factual correctness by asking questions about the image.

Why it matters?

This work is important because it addresses a significant limitation of current AI image generation. Being able to reliably create and edit structured visuals opens up possibilities for automated report generation, educational materials, data visualization, and more. By releasing the dataset, model, and testing tools, the researchers hope to encourage further development in this area and build AI that can handle a wider range of visual tasks.

Abstract

While modern visual generation models excel at creating aesthetically pleasing natural images, they struggle with producing or editing structured visuals like charts, diagrams, and mathematical figures, which demand composition planning, text rendering, and multimodal reasoning for factual fidelity. To address this, we present the first comprehensive, systematic investigation of this domain, encompassing data construction, model training, and an evaluation benchmark. First, we construct a large-scale dataset of 1.3 million high-quality structured image pairs derived from executable drawing programs and augmented with chain-of-thought reasoning annotations. Building on it, we train a unified model that integrates a VLM with FLUX.1 Kontext via a lightweight connector for enhanced multimodal understanding. A three-stage training curriculum enables progressive feature alignment, knowledge infusion, and reasoning-augmented generation, further boosted by an external reasoner at inference time. Finally, we introduce StructBench, a novel benchmark for generation and editing with over 1,700 challenging instances, and an accompanying evaluation metric, StructScore, which employs a multi-round Q\&A protocol to assess fine-grained factual accuracy. Evaluations of 15 models reveal that even leading closed-source systems remain far from satisfactory. Our model attains strong editing performance, and inference-time reasoning yields consistent gains across diverse architectures. By releasing the dataset, model, and benchmark, we aim to advance unified multimodal foundations for structured visuals.

View Paper