Qianfan-OCR: A Unified End-to-End Model for Document Intelligence

Daxiang Dong, Mingming Zheng, Dong Xu, Chunhua Luo, Bairong Zhuang, Yuxuan Li, Ruoyun He, Haoran Wang, Wenyu Zhang, Wenbo Wang, Yicheng Wang, Xue Xiong, Ayong Zheng, Xiaoying Zuo, Ziwei Ou, Jingnan Gu, Quanhao Guo, Jianmin Wu, Dawei Yin, Dou Shen

2026-03-18

Qianfan-OCR: A Unified End-to-End Model for Document Intelligence

Summary

This paper introduces Qianfan-OCR, a new artificial intelligence model that can understand and process documents like images of text, tables, and charts all in one go, directly converting them into a readable format like Markdown.

What's the problem?

Traditional Optical Character Recognition (OCR) systems often handle text recognition and understanding document layout as separate steps, which can lead to errors, especially with complex documents. Newer 'end-to-end' OCR models try to do everything at once, but they sometimes struggle to accurately understand *where* things are on the page – the layout – because they don't explicitly focus on it.

What's the solution?

The researchers developed a technique called 'Layout-as-Thought'. This allows the model to first 'think' about the layout of the document, identifying things like boxes around text, what type of element it is (like a table or heading), and the order to read it in. It does this when prompted with special 'think tokens'. This initial layout analysis step helps the model understand the document better before generating the final output, improving accuracy.

Why it matters?

Qianfan-OCR is a significant step forward because it achieves state-of-the-art results on several document understanding benchmarks, even outperforming larger and more complex models like Gemini-3.1-Pro in some areas. This means it's a powerful tool for tasks like extracting information from invoices, understanding charts, and answering questions about documents, and it's readily available for use through the Baidu AI Cloud Qianfan platform.

Abstract

We present Qianfan-OCR, a 4B-parameter end-to-end vision-language model that unifies document parsing, layout analysis, and document understanding within a single architecture. It performs direct image-to-Markdown conversion and supports diverse prompt-driven tasks including table extraction, chart understanding, document QA, and key information extraction. To address the loss of explicit layout analysis in end-to-end OCR, we propose Layout-as-Thought, an optional thinking phase triggered by special think tokens that generates structured layout representations -- bounding boxes, element types, and reading order -- before producing final outputs, recovering layout grounding capabilities while improving accuracy on complex layouts. Qianfan-OCR ranks first among end-to-end models on OmniDocBench v1.5 (93.12) and OlmOCR Bench (79.8), achieves competitive results on OCRBench, CCOCR, DocVQA, and ChartQA against general VLMs of comparable scale, and attains the highest average score on public key information extraction benchmarks, surpassing Gemini-3.1-Pro, Seed-2.0, and Qwen3-VL-235B. The model is publicly accessible via the Baidu AI Cloud Qianfan platform.

View Paper