POINTS-Reader: Distillation-Free Adaptation of Vision-Language Models for Document Conversion
Yuan Liu, Zhongyin Zhao, Le Tian, Haicheng Wang, Xubing Ye, Yangxiu You, Zilin Yu, Chuhan Wu, Xiao Zhou, Yang Yu, Jie Zhou
2025-09-03
Summary
This paper focuses on creating better computer programs that can accurately convert images of documents, like scanned papers or PDFs, into a digital text format that computers can understand, even when those documents have complicated layouts like tables and formulas.
What's the problem?
Training these document conversion programs usually requires a huge amount of accurately labeled data, meaning someone has to manually identify and mark what's important in each document. This is expensive and takes a long time. While you can try to automatically label data, existing programs aren't very good at handling complex document formats, which limits how well the final conversion program performs in the real world.
What's the solution?
The researchers developed a two-step process that doesn't rely on having a pre-trained 'teacher' model to guide it. First, they created a large collection of fake, but realistic, document images to train a model to initially recognize key elements. Then, they used that model to automatically label real document images, carefully checked the quality of those labels, and used the verified data to further improve the model. They repeated this process, constantly refining both the model and the labeled data, resulting in a more accurate conversion program called POINTS-Reader.
Why it matters?
This work is important because it provides a way to build high-quality document conversion models without the massive cost and effort of manual labeling. A more accurate conversion program can unlock information trapped in documents, making it easier to search, analyze, and use that data in various applications, and their model, POINTS-Reader, performs better than many existing options.
Abstract
High-quality labeled data is essential for training accurate document conversion models, particularly in domains with complex formats such as tables, formulas, and multi-column text. However, manual annotation is both costly and time-consuming, while automatic labeling using existing models often lacks accuracy in handling such challenging scenarios. Consequently, training student models by distilling outputs from teacher models can significantly limit their performance in real-world applications. In this paper, we propose a fully automated, distillation-free framework comprising two stages for constructing high-quality document extraction datasets and models capable of handling diverse document formats and layouts. In the first stage, we introduce a method for generating large-scale, diverse synthetic data, which enables a model to extract key elements in a unified format with strong initial performance. In the second stage, we present a self-improvement approach that further adapts the model, initially trained on synthetic data, to real-world documents. Specifically, we first use the fine-tuned model to annotate real documents, then apply a suite of filtering strategies to verify annotation quality, and finally retrain the model on the verified dataset. By iteratively repeating this process, we progressively enhance both the model's conversion capabilities and the quality of the generated data. We train a public POINTS-1.5 model to obtain POINTS-Reader, which surpasses many existing public and proprietary models of comparable or larger size. Our model is available at https://github.com/Tencent/POINTS-Reader.