Qalam : A Multimodal LLM for Arabic Optical Character and Handwriting Recognition

Gagan Bhatia, El Moatez Billah Nagoudi, Fakhraddin Alwajih, Muhammad Abdul-Mageed

2024-07-22

Qalam : A Multimodal LLM for Arabic Optical Character and Handwriting Recognition

Summary

This paper discusses Qalam, a new model designed for recognizing Arabic text and handwriting. It uses advanced technology to improve how machines understand the unique features of the Arabic script, achieving high accuracy in its tasks.

What's the problem?

Arabic Optical Character Recognition (OCR) and Handwriting Recognition (HWR) are challenging because the Arabic script is cursive and context-sensitive. This means that letters can change shape depending on their position in a word, and there are many different styles of writing. Additionally, there is a lack of high-quality datasets for training models, which makes it difficult to develop effective recognition systems.

What's the solution?

The authors created Qalam, which combines a SwinV2 encoder and a RoBERTa decoder to effectively recognize Arabic text and handwriting. They trained Qalam on a large and diverse dataset that includes over 4.5 million images from Arabic manuscripts and synthetic data with 60,000 image-text pairs. This extensive training helps Qalam achieve impressive results, with a Word Error Rate (WER) of only 0.80% for handwriting recognition and 1.18% for OCR tasks. The model also performs well with high-resolution inputs and handles Arabic diacritics effectively.

Why it matters?

This research is significant because it addresses the specific challenges of recognizing Arabic text, providing a powerful tool for improving digital access to Arabic documents. By enhancing OCR and HWR capabilities, Qalam can help in various applications such as archiving historical texts, improving educational resources, and facilitating communication in Arabic-speaking regions.

Abstract

Arabic Optical Character Recognition (OCR) and Handwriting Recognition (HWR) pose unique challenges due to the cursive and context-sensitive nature of the Arabic script. This study introduces Qalam, a novel foundation model designed for Arabic OCR and HWR, built on a SwinV2 encoder and RoBERTa decoder architecture. Our model significantly outperforms existing methods, achieving a Word Error Rate (WER) of just 0.80% in HWR tasks and 1.18% in OCR tasks. We train Qalam on a diverse dataset, including over 4.5 million images from Arabic manuscripts and a synthetic dataset comprising 60k image-text pairs. Notably, Qalam demonstrates exceptional handling of Arabic diacritics, a critical feature in Arabic scripts. Furthermore, it shows a remarkable ability to process high-resolution inputs, addressing a common limitation in current OCR systems. These advancements underscore Qalam's potential as a leading solution for Arabic script recognition, offering a significant leap in accuracy and efficiency.

View Paper