CAMEL-Bench: A Comprehensive Arabic LMM Benchmark

Sara Ghaboura, Ahmed Heakl, Omkar Thawakar, Ali Alharthi, Ines Riahi, Abduljalil Saif, Jorma Laaksonen, Fahad S. Khan, Salman Khan, Rao M. Anwer

2024-10-25

CAMEL-Bench: A Comprehensive Arabic LMM Benchmark

Summary

This paper introduces CAMEL-Bench, a new evaluation benchmark designed specifically for large multimodal models (LMMs) that focus on the Arabic language, aiming to improve their performance in various tasks.

What's the problem?

Most existing benchmarks for evaluating multimodal models are primarily focused on English, which leaves a significant gap for Arabic speakers. With over 400 million Arabic speakers worldwide, there is a need for comprehensive evaluation tools that can assess how well LMMs understand and process Arabic in different contexts.

What's the solution?

The authors developed CAMEL-Bench, which includes a diverse set of tasks across eight domains and 38 sub-domains, such as understanding images, videos, and medical documents. This benchmark consists of around 29,036 questions that have been carefully curated and verified by native Arabic speakers to ensure quality and relevance. By evaluating both open-source and closed-source models, including popular ones like GPT-4, the benchmark provides insights into the current capabilities and limitations of Arabic LMMs.

Why it matters?

This research is important because it fills a critical gap in the evaluation of Arabic language models. By providing a robust benchmark like CAMEL-Bench, researchers can better understand how these models perform on various tasks, leading to improvements in AI systems that serve Arabic-speaking populations. This advancement is essential as AI becomes increasingly integrated into everyday applications and services.

Abstract

Recent years have witnessed a significant interest in developing large multimodal models (LMMs) capable of performing various visual reasoning and understanding tasks. This has led to the introduction of multiple LMM benchmarks to evaluate LMMs on different tasks. However, most existing LMM evaluation benchmarks are predominantly English-centric. In this work, we develop a comprehensive LMM evaluation benchmark for the Arabic language to represent a large population of over 400 million speakers. The proposed benchmark, named CAMEL-Bench, comprises eight diverse domains and 38 sub-domains including, multi-image understanding, complex visual perception, handwritten document understanding, video understanding, medical imaging, plant diseases, and remote sensing-based land use understanding to evaluate broad scenario generalizability. Our CAMEL-Bench comprises around 29,036 questions that are filtered from a larger pool of samples, where the quality is manually verified by native speakers to ensure reliable model assessment. We conduct evaluations of both closed-source, including GPT-4 series, and open-source LMMs. Our analysis reveals the need for substantial improvement, especially among the best open-source models, with even the closed-source GPT-4o achieving an overall score of 62%. Our benchmark and evaluation scripts are open-sourced.

View Paper