MinerU-Diffusion: Rethinking Document OCR as Inverse Rendering via Diffusion Decoding

Hejun Dong, Junbo Niu, Bin Wang, Weijun Zeng, Wentao Zhang, Conghui He

2026-03-25

MinerU-Diffusion: Rethinking Document OCR as Inverse Rendering via Diffusion Decoding

Summary

This paper introduces a new approach to Optical Character Recognition (OCR), which is the process of converting images of text into machine-readable text. It focuses on improving how OCR handles complex documents with layouts, tables, and formulas.

What's the problem?

Traditional OCR systems often process text sequentially, like reading a book from left to right. This method is slow, especially for long documents, and if a mistake is made early on, it can snowball and cause more errors later. Existing systems, even those using advanced vision-language models, struggle with speed and accuracy when dealing with complex document structures.

What's the solution?

The researchers propose a new system called MinerU-Diffusion. Instead of reading text sequentially, it uses a technique called 'diffusion,' which essentially builds up the text in parallel, like filling in missing pieces of a puzzle. This is inspired by how images are created, and it allows the system to process the entire document at once, making it much faster. They also developed a way to train the system effectively and handle uncertainty during the process.

Why it matters?

This research is important because it significantly speeds up OCR processing, making it up to 3.2 times faster than current methods. It also makes OCR more reliable, especially for complex documents, and reduces its reliance on understanding the language itself, focusing more on the visual aspects of the text. This could lead to better document digitization and analysis in many fields.

Abstract

Optical character recognition (OCR) has evolved from line-level transcription to structured document parsing, requiring models to recover long-form sequences containing layout, tables, and formulas. Despite recent advances in vision-language models, most existing systems rely on autoregressive decoding, which introduces sequential latency and amplifies error propagation in long documents. In this work, we revisit document OCR from an inverse rendering perspective, arguing that left-to-right causal generation is an artifact of serialization rather than an intrinsic property of the task. Motivated by this insight, we propose MinerU-Diffusion, a unified diffusion-based framework that replaces autoregressive sequential decoding with parallel diffusion denoising under visual conditioning. MinerU-Diffusion employs a block-wise diffusion decoder and an uncertainty-driven curriculum learning strategy to enable stable training and efficient long-sequence inference. Extensive experiments demonstrate that MinerU-Diffusion consistently improves robustness while achieving up to 3.2x faster decoding compared to autoregressive baselines. Evaluations on the proposed Semantic Shuffle benchmark further confirm its reduced dependence on linguistic priors and stronger visual OCR capability.

View Paper