ARB: A Comprehensive Arabic Multimodal Reasoning Benchmark (2505.17021v1)

Published 22 May 2025 in cs.CV

Abstract: As Large Multimodal Models (LMMs) become more capable, there is growing interest in evaluating their reasoning processes alongside their final outputs. However, most benchmarks remain focused on English, overlooking languages with rich linguistic and cultural contexts, such as Arabic. To address this gap, we introduce the Comprehensive Arabic Multimodal Reasoning Benchmark (ARB), the first benchmark designed to evaluate step-by-step reasoning in Arabic across both textual and visual modalities. ARB spans 11 diverse domains, including visual reasoning, document understanding, OCR, scientific analysis, and cultural interpretation. It comprises 1,356 multimodal samples paired with 5,119 human-curated reasoning steps and corresponding actions. We evaluated 12 state-of-the-art open- and closed-source LMMs and found persistent challenges in coherence, faithfulness, and cultural grounding. ARB offers a structured framework for diagnosing multimodal reasoning in underrepresented languages and marks a critical step toward inclusive, transparent, and culturally aware AI systems. We release the benchmark, rubric, and evaluation suit to support future research and reproducibility. Code available at: https://github.com/mbzuai-oryx/ARB

PDF Abstract

A Comprehensive Arabic Multimodal Reasoning Benchmark (ARB)

The paper introduces a benchmark that addresses a significant gap in the evaluation of large multimodal models (LMMs), particularly with regard to the Arabic language—a language replete with linguistic nuance and cultural richness, spoken by over 400 million individuals globally. Unlike most existing benchmarks that predominantly cater to English, ARB—a Comprehensive Arabic Multimodal Reasoning Benchmark—has been meticulously crafted to evaluate step-by-step reasoning in Arabic, spanning both textual and visual modalities. This benchmark encompasses 11 diverse domains including visual reasoning, document understanding, optical character recognition (OCR), scientific analysis, and cultural interpretation, among others. The dataset consists of 1,356 multimodal samples, each integrating a visual input with an Arabic prompt and a detailed reasoning process comprising 5,119 human-curated steps.

In constructing ARB, the authors employed a systematic approach that entailed sourcing data from curated datasets, synthetic generation, human-in-the-loop refinements, and native speaker validation. The introduction of ARB marks a pivotal advancement in the domain of inclusive AI, providing a structured method for diagnosing multimodal reasoning within underrepresented languages. This is further amplified by releasing the benchmark, rubric, and evaluation suite to facilitate future research and reproducibility.

The evaluations conducted on 12 state-of-the-art LMMs, both open-source and closed-source, unveiled persistent challenges in the coherence, faithfulness, and cultural grounding of reasoning processes in Arabic, thus underscoring the necessity and relevance of ARB. Remarkably, the closed-source models generally outperformed their open-source counterparts, with models such as GPT-4.1 and GPT-4o-mini demonstrating high reasoning coherence yet only moderate success in reaching correct final answers. This observation highlights a consistent gap in these models between producing coherent reasoning steps and deriving accurate conclusions—a gap that merely emphasizes the importance of step-level evaluation rather than solely relying on final answer correctness.

The implications of this research are manifold. Practically, ARB offers a robust framework for refining Arabic multimodal reasoning tasks, thus providing essential tools to aid the development of culturally aware and interpretable AI systems. Theoretically, it enriches the understanding of how linguistic and cultural contexts can profoundly influence AI performance, paving the way for future models that can learn from and adapt to diverse linguistic environments. Speculatively, this work may inspire subsequent endeavors aimed at refining reasoning benchmarks for other underrepresented languages, contributing globally to the emergence of inclusive and transparent AI systems.

Overall, while Arabic reasoning capabilities remain less developed when juxtaposed against English, ARB strives to bridge this gap, reaffirming the significance of considering both linguistic and cultural dimensions in evaluating AI systems. Therefore, ARB sets a precedent for future benchmarks and models by highlighting the nuances of reasoning in diverse linguistic contexts, which is crucial for the evolution of AI in multilingual landscapes.

The anticipation is that ARB will significantly advance the efficacy of Arabic-centric AI and inspire innovations across the field, thus contributing to a more equitable and culturally sensitive deployment of technology. Ultimately, ARB exemplifies a profound step toward achieving AI systems that not only excel technically but also resonate authentically with varied human experiences and perspectives.

PDF Markdown Bookmark Chat (Pro)

Authors (8)

Sara Ghaboura (5 papers)
Ketan More (6 papers)
Wafa Alghallabi (3 papers)
Omkar Thawakar (15 papers)
Jorma Laaksonen (37 papers)
Hisham Cholakkal (78 papers)
Salman Khan (244 papers)
Rao Muhammad Anwer (67 papers)

Related Papers

Find Related Papers

GitHub

GitHub - mbzuai-oryx/ARB: ARB: A Comprehensive Arabic Multimodal Reasoning Benchmark (2 stars)

Tweets

https://twitter.com/KhanSalmanH/status/1925801134888763802

YouTube

Show All Videos