PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model

Cheng Cui, Ting Sun, Suyin Liang, Tingquan Gao, Zelun Zhang, Jiaxuan Liu, Xueqing Wang, Changda Zhou, Hongen Liu, Manhui Lin, Yue Zhang, Yubo Zhang, Handong Zheng, Jing Zhang, Jun Zhang, Yi Liu, Dianhai Yu, Yanjun Ma

2025-10-17

PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model

Summary

This paper introduces PaddleOCR-VL, a new computer model designed to understand and extract information from documents, like scanned papers or images of text.

What's the problem?

Currently, many document understanding systems are either very large and require a lot of computing power, or they aren't very accurate at recognizing all the different parts of a document – things like text, tables, and even charts. It's hard to find a system that's both powerful *and* efficient, especially one that works well with many different languages.

What's the solution?

The researchers created PaddleOCR-VL, which is built around a smaller, more efficient 'brain' called PaddleOCR-VL-0.9B. This 'brain' combines two key parts: a visual component that looks at the document image and a language component that understands the text. It's designed to handle 109 languages and accurately identify different elements within a document, all while using fewer resources than other similar systems.

Why it matters?

PaddleOCR-VL is important because it offers a strong balance between accuracy and efficiency. This means it can be used in real-world applications where computing power is limited or speed is crucial, like automatically processing invoices or extracting data from legal documents. It performs as well as, or better than, larger systems, making it a practical solution for many document processing tasks.

Abstract

In this report, we propose PaddleOCR-VL, a SOTA and resource-efficient model tailored for document parsing. Its core component is PaddleOCR-VL-0.9B, a compact yet powerful vision-language model (VLM) that integrates a NaViT-style dynamic resolution visual encoder with the ERNIE-4.5-0.3B language model to enable accurate element recognition. This innovative model efficiently supports 109 languages and excels in recognizing complex elements (e.g., text, tables, formulas, and charts), while maintaining minimal resource consumption. Through comprehensive evaluations on widely used public benchmarks and in-house benchmarks, PaddleOCR-VL achieves SOTA performance in both page-level document parsing and element-level recognition. It significantly outperforms existing solutions, exhibits strong competitiveness against top-tier VLMs, and delivers fast inference speeds. These strengths make it highly suitable for practical deployment in real-world scenarios.

View Paper