A Token-level Text Image Foundation Model for Document Understanding

Tongkun Guan, Zining Wang, Pei Fu, Zhengtao Guo, Wei Shen, Kai Zhou, Tiezhu Yue, Chen Duan, Hao Sun, Qianyi Jiang, Junfeng Luo, Xiaokang Yang

2025-03-05

A Token-level Text Image Foundation Model for Document Understanding

Summary

This paper talks about TokenOCR, a new AI model designed to better understand images with text, like documents, by focusing on individual words and their locations in the image.

What's the problem?

Current AI models struggle with tasks involving small, dense text in images, like reading documents or answering questions about them. These models often make mistakes because they lack detailed supervision to understand text at a fine level.

What's the solution?

The researchers created TokenOCR, which uses a specially designed dataset called TokenIT containing 20 million images and 1.8 billion word-location pairs. TokenOCR aligns visual data with text data at the word level, improving how the AI understands and processes text in images. They also developed TokenVL, which uses TokenOCR to handle complex tasks like answering questions about documents.

Why it matters?

This matters because it improves how AI can read and understand documents, making it more accurate for tasks like scanning receipts, analyzing reports, or answering questions about forms. This could lead to better tools for businesses, education, and everyday use.

Abstract

In recent years, general visual foundation models (VFMs) have witnessed increasing adoption, particularly as image encoders for popular multi-modal large language models (MLLMs). However, without semantically fine-grained supervision, these models still encounter fundamental prediction errors in the context of downstream text-image-related tasks, i.e., perception, understanding and reasoning with images containing small and dense texts. To bridge this gap, we develop TokenOCR, the first token-level visual foundation model specifically tailored for text-image-related tasks, designed to support a variety of traditional downstream applications. To facilitate the pretraining of TokenOCR, we also devise a high-quality data production pipeline that constructs the first token-level image text dataset, TokenIT, comprising 20 million images and 1.8 billion token-mask pairs. Furthermore, leveraging this foundation with exceptional image-as-text capability, we seamlessly replace previous VFMs with TokenOCR to construct a document-level MLLM, TokenVL, for VQA-based document understanding tasks. Finally, extensive experiments demonstrate the effectiveness of TokenOCR and TokenVL. Code, datasets, and weights will be available at https://token-family.github.io/TokenOCR_project.

View Paper