MinerU2.5-Pro: Pushing the Limits of Data-Centric Document Parsing at Scale
Bin Wang, Tianyao He, Linke Ouyang, Fan Wu, Zhiyuan Zhao, Tao Chu, Yuan Qu, Zhenjiang Jin, Weijun Zeng, Ziyang Miao, Bangrui Xu, Junbo Niu, Mengzhang Cai, Jiantao Qiu, Qintong Zhang, Dongsheng Ma, Yuefeng Sun, Hejun Dong, Wenzheng Zhang, Jutao Xiao, Jiayong Shi, Pengyu Liao
2026-04-07
Summary
This paper focuses on improving how computers understand documents, but instead of building bigger and more complex models, it concentrates on making the training data better.
What's the problem?
Currently, a lot of effort in document understanding goes into creating new and improved model architectures. However, the authors noticed that even different models with varying sizes consistently struggle with the same difficult examples. This suggests the problem isn't necessarily the model itself, but rather the quality and variety of the data used to train them. Essentially, models are only as good as the information they learn from.
What's the solution?
The researchers developed a system called \minerupro that doesn't change the underlying model's structure at all. Instead, they focused on three key areas of data engineering: first, they significantly increased the amount of training data, making sure it was diverse and representative of real-world documents. Second, they used multiple models to check each other's work, identifying and correcting difficult or ambiguous examples. Finally, they created a process to refine the accuracy of annotations for these challenging cases. They also used a specific training strategy, starting with broad learning and then focusing on the hardest examples.
Why it matters?
This work is important because it demonstrates that substantial improvements in document understanding can be achieved simply by improving the training data, without needing to build increasingly large and expensive models. This could make advanced document processing more accessible and efficient, as it’s often cheaper and easier to gather and refine data than to design entirely new model architectures. It also introduces a more reliable way to evaluate document understanding systems by fixing biases in existing benchmarks and creating a more challenging test set.
Abstract
Current document parsing methods compete primarily on model architecture innovation, while systematic engineering of training data remains underexplored. Yet SOTA models of different architectures and parameter scales exhibit highly consistent failure patterns on the same set of hard samples, suggesting that the performance bottleneck stems from shared deficiencies in training data rather than architecture itself. Building on this finding, we present \minerupro, which advances the state of the art solely through data engineering and training strategy optimization while keeping the 1.2B-parameter architecture of \mineru completely fixed. At its core is a Data Engine co-designed around coverage, informativeness, and annotation accuracy: Diversity-and-Difficulty-Aware Sampling expands training data from under 10M to 65.5M samples while correcting distribution shift; Cross-Model Consistency Verification leverages output agreement among heterogeneous models to assess sample difficulty and generate reliable annotations; the Judge-and-Refine pipeline improves annotation quality for hard samples through render-then-verify iterative correction. A three-stage progressive training strategy -- large-scale pre-training, hard sample fine-tuning, and GRPO alignment -- sequentially exploits these data at different quality tiers. On the evaluation front, we fix element-matching biases in OmniDocBench~v1.5 and introduce a Hard subset, establishing the more discriminative OmniDocBench~v1.6 protocol. Without any architectural modification, \minerupro achieves 95.69 on OmniDocBench~v1.6, improving over the same-architecture baseline by 2.71 points and surpassing all existing methods including models with over 200times more parameters.