AraLingBench A Human-Annotated Benchmark for Evaluating Arabic Linguistic Capabilities of Large Language Models

Mohammad Zbib, Hasan Abed Al Kader Hammoud, Sina Mukalled, Nadine Rizk, Fatima Karnib, Issam Lakkis, Ammar Mohanna, Bernard Ghanem

2025-11-19

AraLingBench A Human-Annotated Benchmark for Evaluating Arabic Linguistic Capabilities of Large Language Models

Summary

This paper introduces AraLingBench, a new way to test how well large language models (LLMs) actually *understand* Arabic, not just how well they can process it.

What's the problem?

Currently, it's hard to tell if LLMs that perform well on Arabic tasks are truly understanding the language or just memorizing patterns and information. Existing tests often focus on general knowledge, not the core building blocks of the language like grammar and sentence structure. This means models can get good scores without actually grasping how Arabic works.

What's the solution?

The researchers created AraLingBench, which consists of 150 multiple-choice questions specifically designed to test five key areas of Arabic linguistics: grammar, morphology (how words are formed), spelling, reading comprehension, and syntax (how words are arranged in sentences). They then tested 35 different Arabic and bilingual LLMs using this benchmark to see how they performed on these fundamental skills.

Why it matters?

AraLingBench is important because it provides a more accurate way to measure the linguistic abilities of LLMs in Arabic. It helps identify weaknesses in current models and gives developers a tool to build better, more genuinely understanding Arabic language models. It moves beyond simply checking if a model *knows* things to checking if it *understands* how the language is constructed.

Abstract

We present AraLingBench: a fully human annotated benchmark for evaluating the Arabic linguistic competence of large language models (LLMs). The benchmark spans five core categories: grammar, morphology, spelling, reading comprehension, and syntax, through 150 expert-designed multiple choice questions that directly assess structural language understanding. Evaluating 35 Arabic and bilingual LLMs reveals that current models demonstrate strong surface level proficiency but struggle with deeper grammatical and syntactic reasoning. AraLingBench highlights a persistent gap between high scores on knowledge-based benchmarks and true linguistic mastery, showing that many models succeed through memorization or pattern recognition rather than authentic comprehension. By isolating and measuring fundamental linguistic skills, AraLingBench provides a diagnostic framework for developing Arabic LLMs. The full evaluation code is publicly available on GitHub.

View Paper