OCRVerse: Towards Holistic OCR in End-to-End Vision-Language Models

Yufeng Zhong, Lei Chen, Xuanle Zhao, Wenkang Han, Liming Zheng, Jing Huang, Deyang Jiang, Yilin Cao, Lin Ma, Zhixiong Zeng

2026-01-30

OCRVerse: Towards Holistic OCR in End-to-End Vision-Language Models

Summary

This paper introduces OCRVerse, a new system designed to read information from images, going beyond just recognizing text to also understand visual elements like charts and graphs.

What's the problem?

Current Optical Character Recognition (OCR) technology is really good at reading text from things like scanned documents, but it struggles with images that have a lot of visual information packed in, like complex charts, web pages, or scientific plots. These visually dense images are common online and contain valuable data, but existing OCR systems aren't equipped to handle them effectively. They treat OCR as either a text problem or a vision problem, but not both at once.

What's the solution?

The researchers created OCRVerse, which is designed to handle both text-based and visually-rich images in a single system. They did this by first building a huge dataset containing both types of images. Then, they used a special training process that combines two techniques: supervised learning to get the system started, and reinforcement learning to fine-tune it for different types of images. The reinforcement learning part is key because it allows the system to learn what a 'good' answer looks like for each type of image – a chart needs different information extracted than a book page, for example.

Why it matters?

This work is important because it makes it easier to automatically extract useful information from the vast number of images available online. This has applications in areas like understanding data presented in charts, analyzing web pages, and even interpreting scientific research. By creating a system that can handle both text and visuals, OCRVerse opens up possibilities for more powerful and versatile image understanding.

Abstract

The development of large vision language models drives the demand for managing, and applying massive amounts of multimodal data, making OCR technology, which extracts information from visual images, increasingly popular. However, existing OCR methods primarily focus on recognizing text elements from images or scanned documents (Text-centric OCR), neglecting the identification of visual elements from visually information-dense image sources (Vision-centric OCR), such as charts, web pages and science plots. In reality, these visually information-dense images are widespread on the internet and have significant real-world application value, such as data visualization and web page analysis. In this technical report, we propose OCRVerse, the first holistic OCR method in end-to-end manner that enables unified text-centric OCR and vision-centric OCR. To this end, we constructe comprehensive data engineering to cover a wide range of text-centric documents, such as newspapers, magazines and books, as well as vision-centric rendered composites, including charts, web pages and scientific plots. Moreover, we propose a two-stage SFT-RL multi-domain training method for OCRVerse. SFT directly mixes cross-domain data to train and establish initial domain knowledge, while RL focuses on designing personalized reward strategies for the characteristics of each domain. Specifically, since different domains require various output formats and expected outputs, we provide sufficient flexibility in the RL stage to customize flexible reward signals for each domain, thereby improving cross-domain fusion and avoiding data conflicts. Experimental results demonstrate the effectiveness of OCRVerse, achieving competitive results across text-centric and vision-centric data types, even comparable to large-scale open-source and closed-source models.

View Paper