General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model

Haoran Wei, Chenglong Liu, Jinyue Chen, Jia Wang, Lingyu Kong, Yanming Xu, Zheng Ge, Liang Zhao, Jianjian Sun, Yuang Peng, Chunrui Han, Xiangyu Zhang

2024-09-04

General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model

Summary

This paper talks about a new concept called General OCR Theory and introduces a model named GOT, which aims to improve how computers recognize and process various types of visual information.

What's the problem?

Traditional Optical Character Recognition (OCR) systems are becoming outdated as they struggle to keep up with the increasing demand for intelligent processing of different types of visual data, like text, math formulas, charts, and more. These older systems often lack the ability to handle diverse formats effectively.

What's the solution?

The authors propose the General OCR Theory and develop a new model called GOT, which is designed to work with a wide range of visual characters. GOT uses a high-compression encoder and a long-context decoder to process images and generate results. It can handle different styles of images, from documents to scenes, and produce outputs in various formats. Additionally, it includes features for interactive recognition based on specific regions or colors in the images.

Why it matters?

This research is important because it represents a significant step forward in OCR technology. By creating a unified model that can process many types of visual data without needing extensive customization, it can enhance applications in fields like education, data analysis, and digital archiving, making it easier for people to access and understand information.

Abstract

Traditional OCR systems (OCR-1.0) are increasingly unable to meet people's usage due to the growing demand for intelligent processing of man-made optical characters. In this paper, we collectively refer to all artificial optical signals (e.g., plain texts, math/molecular formulas, tables, charts, sheet music, and even geometric shapes) as "characters" and propose the General OCR Theory along with an excellent model, namely GOT, to promote the arrival of OCR-2.0. The GOT, with 580M parameters, is a unified, elegant, and end-to-end model, consisting of a high-compression encoder and a long-contexts decoder. As an OCR-2.0 model, GOT can handle all the above "characters" under various OCR tasks. On the input side, the model supports commonly used scene- and document-style images in slice and whole-page styles. On the output side, GOT can generate plain or formatted results (markdown/tikz/smiles/kern) via an easy prompt. Besides, the model enjoys interactive OCR features, i.e., region-level recognition guided by coordinates or colors. Furthermore, we also adapt dynamic resolution and multi-page OCR technologies to GOT for better practicality. In experiments, we provide sufficient results to prove the superiority of our model.

View Paper