Multimodal OCR: Parse Anything from Documents
Handong Zheng, Yumeng Li, Kaile Zhang, Liang Xin, Guangwei Zhao, Hao Liu, Jiayu Chen, Jie Lou, Jiyu Qiu, Qi Fu, Rui Yang, Shuo Jiang, Weijian Luo, Weijie Su, Weijun Zhang, Xingyu Zhu, Yabin Li, Yiwei ma, Yu Chen, Zhaohui Yu, Guang Yang, Colin Zhang
2026-03-16
Summary
This paper introduces a new approach to Optical Character Recognition (OCR) called Multimodal OCR, or MOCR, which doesn't just read text from documents but also understands and reconstructs images like charts and diagrams as usable code.
What's the problem?
Traditional OCR systems mainly focus on recognizing text and treat images as just pictures, essentially throwing away important information about the document's structure and meaning. This limits how well computers can truly 'understand' a document and makes it hard to automatically recreate or edit it accurately. They don't see the relationships between text and visuals.
What's the solution?
The researchers developed a system called dots.mocr that treats both text and graphics as equally important parts of a document. It's trained to understand how these elements relate to each other and can rebuild them as structured data, almost like code. They created a large dataset of documents and trained a relatively small model to do this effectively, using a process of pre-training and fine-tuning.
Why it matters?
This work is important because it moves beyond simply recognizing text to actually understanding the content of a document, including its visual elements. This opens the door to better document reconstruction, automated editing, and creating large datasets for training even more powerful AI models that can work with both images and text seamlessly, potentially leading to more advanced document processing and understanding capabilities.
Abstract
We present Multimodal OCR (MOCR), a document parsing paradigm that jointly parses text and graphics into unified textual representations. Unlike conventional OCR systems that focus on text recognition and leave graphical regions as cropped pixels, our method, termed dots.mocr, treats visual elements such as charts, diagrams, tables, and icons as first-class parsing targets, enabling systems to parse documents while preserving semantic relationships across elements. It offers several advantages: (1) it reconstructs both text and graphics as structured outputs, enabling more faithful document reconstruction; (2) it supports end-to-end training over heterogeneous document elements, allowing models to exploit semantic relations between textual and visual components; and (3) it converts previously discarded graphics into reusable code-level supervision, unlocking multimodal supervision embedded in existing documents. To make this paradigm practical at scale, we build a comprehensive data engine from PDFs, rendered webpages, and native SVG assets, and train a compact 3B-parameter model through staged pretraining and supervised fine-tuning. We evaluate dots.mocr from two perspectives: document parsing and structured graphics parsing. On document parsing benchmarks, it ranks second only to Gemini 3 Pro on our OCR Arena Elo leaderboard, surpasses existing open-source document parsing systems, and sets a new state of the art of 83.9 on olmOCR Bench. On structured graphics parsing, dots.mocr achieves higher reconstruction quality than Gemini 3 Pro across image-to-SVG benchmarks, demonstrating strong performance on charts, UI layouts, scientific figures, and chemical diagrams. These results show a scalable path toward building large-scale image-to-code corpora for multimodal pretraining. Code and models are publicly available at https://github.com/rednote-hilab/dots.mocr.