UI-Level Evaluation of ALLaM 34B: Measuring an Arabic-Centric LLM via HUMAIN Chat
Omer Nacar
2025-09-02
Summary
This paper evaluates how well the ALLaM-34B large language model performs with the Arabic language, considering not just standard Arabic but also different regional dialects and various tasks like reasoning and safety.
What's the problem?
Most large language models are trained primarily on English text, which means they often don't understand the complexities and differences within the Arabic language, including its many dialects and cultural context. This creates a need for language models specifically designed for Arabic to perform well in real-world applications.
What's the solution?
Researchers thoroughly tested ALLaM-34B, a powerful Arabic language model, by giving it a wide range of prompts. These prompts covered standard Arabic, five different dialects, mixing languages within a sentence, testing its knowledge, doing math, understanding time, creating text, and checking for safe responses. They then had three other advanced language models (GPT-4, Gemini, and Claude) judge the quality of ALLaM-34B’s responses, scoring them on a scale of 1 to 5.
Why it matters?
The results show that ALLaM-34B is a strong and reliable Arabic language model, performing particularly well in generating text, understanding code-switching, and handling standard Arabic. This is important because it means there's now a readily available tool that can accurately process and understand Arabic in its many forms, making it useful for a variety of applications and bridging the gap in AI technology for Arabic speakers.
Abstract
Large language models (LLMs) trained primarily on English corpora often struggle to capture the linguistic and cultural nuances of Arabic. To address this gap, the Saudi Data and AI Authority (SDAIA) introduced the ALLaM family of Arabic-focused models. The most capable of these available to the public, ALLaM-34B, was subsequently adopted by HUMAIN, who developed and deployed HUMAIN Chat, a closed conversational web service built on this model. This paper presents an expanded and refined UI-level evaluation of ALLaM-34B. Using a prompt pack spanning modern standard Arabic, five regional dialects, code-switching, factual knowledge, arithmetic and temporal reasoning, creative generation, and adversarial safety, we collected 115 outputs (23 prompts times 5 runs) and scored each with three frontier LLM judges (GPT-5, Gemini 2.5 Pro, Claude Sonnet-4). We compute category-level means with 95\% confidence intervals, analyze score distributions, and visualize dialect-wise metric heat maps. The updated analysis reveals consistently high performance on generation and code-switching tasks (both averaging 4.92/5), alongside strong results in MSA handling (4.74/5), solid reasoning ability (4.64/5), and improved dialect fidelity (4.21/5). Safety-related prompts show stable, reliable performance of (4.54/5). Taken together, these results position ALLaM-34B as a robust and culturally grounded Arabic LLM, demonstrating both technical strength and practical readiness for real-world deployment.