AIN: The Arabic INclusive Large Multimodal Model
Ahmed Heakl, Sara Ghaboura, Omkar Thawkar, Fahad Shahbaz Khan, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan
2025-02-04
Summary
This paper talks about AIN, a new AI model designed to work in both Arabic and English, combining language and visual understanding. It aims to help Arabic speakers by providing advanced tools for tasks like analyzing images, understanding documents, and solving problems across many fields.
What's the problem?
While AI models have made big progress in English and Chinese, there hasn’t been much development for Arabic multimodal models, which combine text and visuals. Existing Arabic models often focus on limited tasks and don’t perform well across different areas, leaving a gap in AI tools for Arabic speakers.
What's the solution?
The researchers created AIN, a bilingual model trained on 3.6 million high-quality Arabic-English data samples. It uses advanced techniques to handle tasks like document understanding, medical imaging, agriculture, and video analysis. AIN was tested on the CAMEL-Bench benchmark across 38 sub-domains and showed strong performance, even outperforming larger models like GPT-4o by 3.4% in accuracy.
Why it matters?
This research is important because it provides Arabic speakers with powerful AI tools that can handle both language and visual tasks effectively. AIN bridges a major gap in AI development for Arabic and sets new standards for performance in multimodal models. It has applications in education, healthcare, agriculture, and more, making advanced technology accessible to a wider audience.
Abstract
Amid the swift progress of large language models (LLMs) and their evolution into large multimodal models (LMMs), significant strides have been made in high-resource languages such as English and Chinese. While Arabic LLMs have seen notable progress, Arabic LMMs remain largely unexplored, often narrowly focusing on a few specific aspects of the language and visual understanding. To bridge this gap, we introduce AIN-the Arabic Inclusive Multimodal Model-designed to excel across diverse domains. AIN is an English-Arabic bilingual LMM designed to excel in English and Arabic, leveraging carefully constructed 3.6 million high-quality Arabic-English <PRE_TAG>multimodal data samples</POST_TAG>. AIN demonstrates state-of-the-art Arabic performance, while also possessing strong English-language visual capabilities. On the recent CAMEL-Bench benchmark comprising 38 sub-domains including, multi-image understanding, complex visual perception, handwritten document understanding, video understanding, medical imaging, plant diseases, and remote sensing-based land use understanding, our AIN demonstrates strong performance with the 7B model outperforming GPT-4o by an absolute gain of 3.4% averaged over eight domains and 38 sub-domains. AIN's superior capabilities position it as a significant step toward empowering Arabic speakers with advanced multimodal generative AI tools across diverse applications.