OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations

Linke Ouyang, Yuan Qu, Hongbin Zhou, Jiawei Zhu, Rui Zhang, Qunshu Lin, Bin Wang, Zhiyuan Zhao, Man Jiang, Xiaomeng Zhao, Jin Shi, Fan Wu, Pei Chu, Minghao Liu, Zhenxiang Li, Chao Xu, Bo Zhang, Botian Shi, Zhongying Tu, Conghui He

2024-12-11

OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations

Summary

This paper talks about OmniDocBench, a new benchmark designed to improve how computers read and understand different types of documents, especially PDFs.

What's the problem?

As AI technology advances, there is a growing need for effective methods to extract information from various document formats. However, existing methods often struggle with the diversity of documents and lack comprehensive evaluation standards, making it difficult to assess their performance accurately.

What's the solution?

The authors introduce OmniDocBench, which includes a carefully curated dataset of 1,000 diverse documents from nine different types, such as academic papers and textbooks. This dataset is annotated with detailed information about the layout and content, allowing for thorough evaluation. OmniDocBench provides a flexible framework that can assess different aspects of document parsing, including text extraction and table detection, enabling researchers to compare various AI models effectively.

Why it matters?

This research is important because it sets a new standard for evaluating how well AI systems can understand and process documents. By providing a comprehensive benchmark, OmniDocBench helps identify strengths and weaknesses in current technologies, guiding future improvements in document parsing. This advancement is crucial for applications in fields like education, business, and legal services, where accurate document processing is essential.

Abstract

Document content extraction is crucial in computer vision, especially for meeting the high-quality data needs of large language models (LLMs) and retrieval-augmented generation (RAG) technologies. However, current document parsing methods suffer from significant limitations in terms of diversity and comprehensive evaluation. To address these challenges, we introduce OmniDocBench, a novel multi-source benchmark designed to advance automated document content extraction. OmniDocBench includes a meticulously curated and annotated high-quality evaluation dataset comprising nine diverse document types, such as academic papers, textbooks, slides, among others. Our benchmark provides a flexible and comprehensive evaluation framework with 19 layout category labels and 14 attribute labels, enabling multi-level assessments across entire datasets, individual modules, or specific data types. Using OmniDocBench, we perform an exhaustive comparative analysis of existing modular pipelines and multimodal end-to-end methods, highlighting their limitations in handling document diversity and ensuring fair evaluation. OmniDocBench establishes a robust, diverse, and fair evaluation standard for the document content extraction field, offering crucial insights for future advancements and fostering the development of document parsing technologies. The codes and dataset is available in https://github.com/opendatalab/OmniDocBench.

View Paper