CAMEL-Bench: A Comprehensive Arabic LMM Benchmark (2410.18976v1)
Abstract: Recent years have witnessed a significant interest in developing large multimodal models (LMMs) capable of performing various visual reasoning and understanding tasks. This has led to the introduction of multiple LMM benchmarks to evaluate LMMs on different tasks. However, most existing LMM evaluation benchmarks are predominantly English-centric. In this work, we develop a comprehensive LMM evaluation benchmark for the Arabic language to represent a large population of over 400 million speakers. The proposed benchmark, named CAMEL-Bench, comprises eight diverse domains and 38 sub-domains including, multi-image understanding, complex visual perception, handwritten document understanding, video understanding, medical imaging, plant diseases, and remote sensing-based land use understanding to evaluate broad scenario generalizability. Our CAMEL-Bench comprises around 29,036 questions that are filtered from a larger pool of samples, where the quality is manually verified by native speakers to ensure reliable model assessment. We conduct evaluations of both closed-source, including GPT-4 series, and open-source LMMs. Our analysis reveals the need for substantial improvement, especially among the best open-source models, with even the closed-source GPT-4o achieving an overall score of 62%. Our benchmark and evaluation scripts are open-sourced.
- Arabicaqa: A comprehensive dataset for arabic question answering. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2049–2059, 2024.
- Google AI. Gemini: A family of highly capable multimodal models, 2023.
- Husni A Al-Muhtaseb. Arabic text recognition of printed manuscripts. Efficient recognition of off-line printed Arabic text using Hidden Markov Models, Bigram Statistical Language Model, and post-processing. PhD thesis, University of Bradford, 2010.
- Peacock: A family of arabic multimodal large language models and benchmarks. arXiv preprint arXiv:2403.01031, 2024.
- Anthropic. Claude, 2024. AI assistant.
- Arar Tawil. Arabic food 101. https://www.kaggle.com/datasets/araraltawil/arabic-food-101, 2023.
- Agrogpt: Efficient agricultural vision-language model with expert tuning. arXiv, 2024.
- Qwen technical report. arXiv preprint arXiv:2309.16609, 2023a.
- Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023b.
- Vizwiz: nearly real-time answers to visual questions. In Proceedings of the 23nd annual ACM symposium on User interface software and technology, pages 333–342, 2010.
- Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. arXiv preprint arXiv:2312.14238, 2023.
- Unifying vision-and-language tasks via text generation. In ICML, 2021.
- Exams-v: A multi-discipline multilingual multimodal exam benchmark for evaluating vision language models. arXiv preprint arXiv:2403.10378, 2024.
- Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 248–255. Ieee, 2009.
- Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023.
- Blink: Multimodal large language models can see but not perceive. arXiv preprint arXiv:2404.12390, 2024.
- Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6904–6913, 2017.
- Arabic scene text recognition in the deep learning era: Analysis on a novel dataset. IEEE Access, 2021.
- Acegpt, localizing large language models in arabic, 2023.
- Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6700–6709, 2019.
- Arabicmmlu: Assessing massive multitask language understanding in arabic. arXiv preprint arXiv:2402.12840, 2024.
- Geochat: Grounded large vision-language model for remote sensing. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024.
- Seed-bench: Benchmarking multimodal large language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proceedings of the International Conference on Machine Learning (ICML), pages 12888–12900. PMLR, 2022.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In Proceedings of the International Conference on Machine Learning (ICML), pages 19730–19742. PMLR, 2023a.
- Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355, 2023b.
- Microsoft coco: Common objects in context. In European Conference on Computer Vision (ECCV), pages 740–755. Springer, 2014.
- Visual instruction tuning. In NeurIPS, 2023.
- Llava-next: Improved reasoning, ocr, and world knowledge, 2024.
- Mmbench: Is your multi-modal model an all-around player? In European Conference on Computer Vision, pages 216–233. Springer, 2025.
- Iconqa: A new benchmark for abstract diagram understanding and visual language reasoning. In The 35th Conference on Neural Information Processing Systems (NeurIPS) Track on Datasets and Benchmarks, 2021.
- Learn to explain: Multimodal reasoning via thought chains for science question answering. In The 36th Conference on Neural Information Processing Systems (NeurIPS), 2022.
- Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv preprint arXiv:2310.02255, 2023.
- Khatt: An open arabic offline handwritten text database. Pattern Recognition, 47(3):1096–1112, 2014.
- ChartQA: A benchmark for question answering about charts with visual and logical reasoning. In Findings of the Association for Computational Linguistics: ACL 2022, pages 2263–2279, Dublin, Ireland, 2022. Association for Computational Linguistics.
- Mohammad-Alfaifi. Github - mohammad-alfaifi/arab-celeb-dataset. https://github.com/mohammad-alfaifi/arab-celeb-dataset, n.d. Accessed: 2024-10-15.
- Historical arabic handwritten text recognition dataset, 2024.
- OpenAI. Gpt-4o model. https://openai.com, 2024. Accessed: 2024-10-14.
- Teaching clip to count to ten. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3170–3180, 2023.
- PATD. Printed arabic text database for recognition systems. http://www.inf.u-szeged.hu/patd/.
- Pexel. Pexel: The best free stock photos, royalty-free images and videos shared by creators. https://www.pexels.com/.
- Pinterest. Pinterest platform. https://www.pinterest.com/.
- Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning (ICML), pages 8748–8763. PMLR, 2021.
- Palo: A large multilingual multimodal language model. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2025.
- Learning string-edit distance. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(5):522–532, 1998.
- Cvqa: Culturally-diverse multilingual visual question answering benchmark. arXiv preprint arXiv:2406.05967, 2024.
- Bce-arabic-v1 dataset: Towards interpreting arabic document images for people with visual impairments. In Proceedings of the 9th ACM International Conference on Pervasive Technologies Related to Assistive Environments, pages 1–8, 2016.
- Jais and jais-chat: Arabic-centric foundation and instruction-tuned open generative large language models. arXiv preprint arXiv:2308.16149, 2023.
- Mtvqa: Benchmarking multilingual text-centric visual question answering, 2024.
- Muirbench: A comprehensive benchmark for robust multi-image understanding. arXiv preprint arXiv:2406.09411, 2024.
- Wikipedia. Wikipedia the free encyclopedia. https://www.wikipedia.org/.
- Self-organized text detection with minimal post-processing via border learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2017.
- xAI. xai. grok-1.5 vision preview. https://x.ai/blog/grok-1.5v, 2024.
- Visual entailment: A novel task for fine-grained image understanding. arXiv preprint arXiv:1901.06706, 2019.
- End-to-end multimodal fact-checking and explanation generation: A challenging dataset and models. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2733–2743, 2023.
- MMT-bench: A comprehensive multimodal benchmark for evaluating large vision-language models towards multitask AGI. In Proceedings of the International Conference on Machine Learning (ICML), 2024.
- YouTube. https://www.youtube.com/, 2024. Accessed: 2024-10-01.
- Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024a.
- Pangea: A fully open multilingual multimodal llm for 39 languages. arXiv preprint arXiv:2410.16153, 2024b.
- Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark. arXiv preprint arXiv:2409.02813, 2024c.