KITAB-Bench: A Comprehensive Multi-Domain Benchmark for Arabic OCR and Document Understanding
Ahmed Heakl, Abdullah Sohail, Mukul Ranjan, Rania Hossam, Ghazi Ahmed, Mohamed El-Geish, Omar Maher, Zhiqiang Shen, Fahad Khan, Salman Khan
2025-02-24
Summary
This paper talks about KITAB-Bench, a new tool designed to test and improve how well AI systems can read and understand Arabic text from different types of documents.
What's the problem?
Arabic text is hard for AI to process because of its unique features, like cursive writing, right-to-left direction, and complex fonts. While English OCR (Optical Character Recognition) systems are advanced, Arabic OCR still struggles with accuracy, especially for handwritten documents, tables, and charts. This makes it difficult to extract reliable information from Arabic documents.
What's the solution?
The researchers created KITAB-Bench, a benchmark with 8,809 examples from nine main domains and 36 subdomains, covering diverse document types like handwritten text, structured tables, and business charts. They tested modern AI models like GPT-4 and Gemini against traditional OCR tools and found that modern models performed much better but still had weaknesses in tasks like converting PDFs to Markdown or recognizing complex fonts and table structures.
Why it matters?
This matters because it helps identify where Arabic OCR systems need improvement and provides a way to measure progress. By addressing these challenges, KITAB-Bench can help create better tools for processing Arabic documents, which is important for education, business, and preserving cultural heritage.
Abstract
With the growing adoption of Retrieval-Augmented Generation (RAG) in document processing, robust text recognition has become increasingly critical for knowledge extraction. While OCR (Optical Character Recognition) for English and other languages benefits from large datasets and well-established benchmarks, Arabic OCR faces unique challenges due to its cursive script, right-to-left text flow, and complex typographic and calligraphic features. We present KITAB-Bench, a comprehensive Arabic OCR benchmark that fills the gaps in current evaluation systems. Our benchmark comprises 8,809 samples across 9 major domains and 36 sub-domains, encompassing diverse document types including handwritten text, structured tables, and specialized coverage of 21 chart types for business intelligence. Our findings show that modern vision-language models (such as GPT-4, Gemini, and Qwen) outperform traditional OCR approaches (like EasyOCR, PaddleOCR, and Surya) by an average of 60% in Character Error Rate (CER). Furthermore, we highlight significant limitations of current Arabic OCR models, particularly in PDF-to-Markdown conversion, where the best model Gemini-2.0-Flash achieves only 65% accuracy. This underscores the challenges in accurately recognizing Arabic text, including issues with complex fonts, numeral recognition errors, word elongation, and table structure detection. This work establishes a rigorous evaluation framework that can drive improvements in Arabic document analysis methods and bridge the performance gap with English OCR technologies.