Platypus: A Generalized Specialist Model for Reading Text in Various Forms
Peng Wang, Zhaohai Li, Jun Tang, Humen Zhong, Fei Huang, Zhibo Yang, Cong Yao
2024-08-28

Summary
This paper discusses Platypus, a new model designed to read text from various forms, such as images and documents, more effectively than previous models.
What's the problem?
Reading text from images (like photos or scanned documents) has always been difficult because existing models are often too specialized. They can only handle specific types of text, such as printed or handwritten, and struggle to adapt to different formats. This makes it hard to create a single system that can read all types of text accurately.
What's the solution?
The authors propose Platypus, a generalized specialist model that combines the strengths of both specialized and generalist models. It can recognize different types of text using a single framework, which improves accuracy and efficiency. They also created a new dataset called Worms to help train and evaluate the model, showing that Platypus performs better than existing methods.
Why it matters?
This research is important because it enables better reading and understanding of text in various formats, which can be useful in many applications like digital archiving, automatic translation, and improving accessibility for people with disabilities. By advancing how machines read text, we can enhance technology in education, business, and everyday life.
Abstract
Reading text from images (either natural scenes or documents) has been a long-standing research topic for decades, due to the high technical challenge and wide application range. Previously, individual specialist models are developed to tackle the sub-tasks of text reading (e.g., scene text recognition, handwritten text recognition and mathematical expression recognition). However, such specialist models usually cannot effectively generalize across different sub-tasks. Recently, generalist models (such as GPT-4V), trained on tremendous data in a unified way, have shown enormous potential in reading text in various scenarios, but with the drawbacks of limited accuracy and low efficiency. In this work, we propose Platypus, a generalized specialist model for text reading. Specifically, Platypus combines the best of both worlds: being able to recognize text of various forms with a single unified architecture, while achieving excellent accuracy and high efficiency. To better exploit the advantage of Platypus, we also construct a text reading dataset (called Worms), the images of which are curated from previous datasets and partially re-labeled. Experiments on standard benchmarks demonstrate the effectiveness and superiority of the proposed Platypus model. Model and data will be made publicly available at https://github.com/AlibabaResearch/AdvancedLiterateMachinery/tree/main/OCR/Platypus.