MinerU: An Open-Source Solution for Precise Document Content Extraction

Bin Wang, Chao Xu, Xiaomeng Zhao, Linke Ouyang, Fan Wu, Zhiyuan Zhao, Rui Xu, Kaiwen Liu, Yuan Qu, Fukai Shang, Bo Zhang, Liqun Wei, Zhihao Sui, Wei Li, Botian Shi, Yu Qiao, Dahua Lin, Conghui He

2024-09-30

MinerU: An Open-Source Solution for Precise Document Content Extraction

Summary

This paper introduces MinerU, an open-source tool designed to accurately extract content from various types of documents, improving the way we handle and analyze document data.

What's the problem?

Existing tools for extracting information from documents, like Optical Character Recognition (OCR) and layout detection, often fail to provide consistent and high-quality results. This is mainly due to the wide variety of document formats and contents, which makes it hard for these tools to work effectively.

What's the solution?

MinerU uses advanced models from the PDF-Extract-Kit to enhance the extraction process. It includes special preprocessing and postprocessing steps that help ensure the accuracy of the extracted content. The tool can handle different document types, such as academic papers and financial reports, and it has been tested to show high performance across these formats.

Why it matters?

MinerU is significant because it provides a reliable way to convert complex documents into structured data formats like Markdown and JSON. This capability is essential for researchers and developers who need accurate document analysis for applications in fields like education, finance, and artificial intelligence.

Abstract

Document content analysis has been a crucial research area in computer vision. Despite significant advancements in methods such as OCR, layout detection, and formula recognition, existing open-source solutions struggle to consistently deliver high-quality content extraction due to the diversity in document types and content. To address these challenges, we present MinerU, an open-source solution for high-precision document content extraction. MinerU leverages the sophisticated PDF-Extract-Kit models to extract content from diverse documents effectively and employs finely-tuned preprocessing and postprocessing rules to ensure the accuracy of the final results. Experimental results demonstrate that MinerU consistently achieves high performance across various document types, significantly enhancing the quality and consistency of content extraction. The MinerU open-source project is available at https://github.com/opendatalab/MinerU.

View Paper