Structured Extraction from Business Process Diagrams Using Vision-Language Models

Pritam Deka, Barry Devereux

2025-12-02

Structured Extraction from Business Process Diagrams Using Vision-Language Models

Summary

This paper explores a new way to understand BPMN diagrams, which are visual maps of how businesses operate, by using artificial intelligence.

What's the problem?

Currently, if you want a computer to 'read' a BPMN diagram and analyze it, you need the original computer file the diagram was created in (usually an XML file). This is a problem because often you only have a picture of the diagram, like a screenshot, and not the original file. Without the original file, it's hard for computers to understand what the diagram actually *means*.

What's the solution?

The researchers developed a system that uses powerful AI models, called Vision-Language Models, to directly 'look' at an image of a BPMN diagram and convert it into a structured format (JSON) that a computer can understand. They also added a step to recognize text within the image using Optical Character Recognition (OCR) to help the AI understand the diagram better. They tested different AI models and ways to improve the text recognition to see what worked best.

Why it matters?

This is important because it means you can analyze business processes even if you only have a picture of the diagram. This is useful in many real-world situations where the original files are lost or unavailable, allowing for analysis and improvement of workflows without needing the source files.

Abstract

Business Process Model and Notation (BPMN) is a widely adopted standard for representing complex business workflows. While BPMN diagrams are often exchanged as visual images, existing methods primarily rely on XML representations for computational analysis. In this work, we present a pipeline that leverages Vision-Language Models (VLMs) to extract structured JSON representations of BPMN diagrams directly from images, without requiring source model files or textual annotations. We also incorporate optical character recognition (OCR) for textual enrichment and evaluate the generated element lists against ground truth data derived from the source XML files. Our approach enables robust component extraction in scenarios where original source files are unavailable. We benchmark multiple VLMs and observe performance improvements in several models when OCR is used for text enrichment. In addition, we conducted extensive statistical analyses of OCR-based enrichment methods and prompt ablation studies, providing a clearer understanding of their impact on model performance.

View Paper