mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page Document Understanding

Anwen Hu, Haiyang Xu, Liang Zhang, Jiabo Ye, Ming Yan, Ji Zhang, Qin Jin, Fei Huang, Jingren Zhou

2024-09-06

mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page Document Understanding

Summary

This paper talks about mPLUG-DocOwl2, a new system designed to improve the understanding of multi-page documents without needing Optical Character Recognition (OCR) by efficiently compressing high-resolution images.

What's the problem?

Understanding multi-page documents using large language models can be slow and requires a lot of memory because these models generate thousands of visual tokens for each document. This makes it hard to process documents quickly, especially when dealing with multiple pages.

What's the solution?

To solve this issue, the authors developed a High-resolution DocCompressor module that reduces each high-resolution document image to just 324 tokens. They also created a training framework called DocOwl2, which includes three stages: training on single images, continuing training on multiple images, and fine-tuning for various tasks. This allows the model to understand documents better while using fewer resources and improving response times.

Why it matters?

This research is important because it enhances the ability to analyze and answer questions about complex documents more efficiently. By reducing the amount of data needed while maintaining high performance, mPLUG-DocOwl2 can help in fields like education, law, and business where understanding large amounts of information quickly is crucial.

Abstract

Multimodel Large Language Models(MLLMs) have achieved promising OCR-free Document Understanding performance by increasing the supported resolution of document images. However, this comes at the cost of generating thousands of visual tokens for a single document image, leading to excessive GPU memory and slower inference times, particularly in multi-page document comprehension. In this work, to address these challenges, we propose a High-resolution DocCompressor module to compress each high-resolution document image into 324 tokens, guided by low-resolution global visual features. With this compression module, to strengthen multi-page document comprehension ability and balance both token efficiency and question-answering performance, we develop the DocOwl2 under a three-stage training framework: Single-image Pretraining, Multi-image Continue-pretraining, and Multi-task Finetuning. DocOwl2 sets a new state-of-the-art across multi-page document understanding benchmarks and reduces first token latency by more than 50%, demonstrating advanced capabilities in multi-page questioning answering, explanation with evidence pages, and cross-page structure understanding. Additionally, compared to single-image MLLMs trained on similar data, our DocOwl2 achieves comparable single-page understanding performance with less than 20% of the visual tokens. Our codes, models, and data are publicly available at https://github.com/X-PLUG/mPLUG-DocOwl/tree/main/DocOwl2.

View Paper