EXAMS-V: A Multi-Discipline Multilingual Multimodal Exam Benchmark for Evaluating Vision Language Models (2403.10378v1)
Abstract: We introduce EXAMS-V, a new challenging multi-discipline multimodal multilingual exam benchmark for evaluating vision LLMs. It consists of 20,932 multiple-choice questions across 20 school disciplines covering natural science, social science, and other miscellaneous studies, e.g., religion, fine arts, business, etc. EXAMS-V includes a variety of multimodal features such as text, images, tables, figures, diagrams, maps, scientific symbols, and equations. The questions come in 11 languages from 7 language families. Unlike existing benchmarks, EXAMS-V is uniquely curated by gathering school exam questions from various countries, with a variety of education systems. This distinctive approach calls for intricate reasoning across diverse languages and relies on region-specific knowledge. Solving the problems in the dataset requires advanced perception and joint reasoning over the text and the visual content of the image. Our evaluation results demonstrate that this is a challenging dataset, which is difficult even for advanced vision-text models such as GPT-4V and Gemini; this underscores the inherent complexity of the dataset and its significance as a future benchmark.
- Flamingo: a visual language model for few-shot learning.
- Gemini: A family of highly capable multimodal models.
- VQA: Visual Question Answering. In International Conference on Computer Vision (ICCV).
- Qwen-VL: A frontier large vision-language model with versatile abilities. ArXiv, abs/2308.12966.
- Language models are few-shot learners. ArXiv, abs/2005.14165.
- VisualGPT: Data-efficient adaptation of pretrained language models for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18030–18040.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
- Palm: Scaling language modeling with pathways. CoRR, abs/2204.02311.
- VizWiz grand challenge: Answering visual questions from blind people.
- EXAMS: A multi-subject high school examinations dataset for cross-lingual and multilingual question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5427–5444, Online. Association for Computational Linguistics.
- Measuring massive multitask language understanding. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net.
- Drew A. Hudson and Christopher D. Manning. 2019. GQA: A new dataset for real-world visual reasoning and compositional question answering.
- Large language models only pass primary school exams in Indonesia: A comprehensive test on IndoMMLU. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12359–12374, Singapore. Association for Computational Linguistics.
- Bactrian-X: A multilingual replicable instruction-following model with low-rank adaptation. CoRR, abs/2305.15011.
- CMMLU: Measuring massive multitask language understanding in chinese.
- BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models.
- Visual instruction tuning. In NeurIPS.
- MMBench: Is your multi-modal model an all-around player?
- LLM360: Towards fully transparent open-source llms.
- MathVista: Evaluating math reasoning in visual contexts with GPT-4V, Bard, and other large multimodal models. ArXiv, abs/2310.02255.
- Learn to explain: Multimodal reasoning via thought chains for science question answering. In The 36th Conference on Neural Information Processing Systems (NeurIPS).
- OpenAI. 2023. GPT-4 technical report.
- BLOOM: A 176b-parameter open-access multilingual language model. CoRR, abs/2211.05100.
- Jais and Jais-chat: Arabic-centric foundation and instruction-tuned open generative large language models. CoRR, abs/2308.16149.
- Towards vqa models that can read.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Llama 2: Open foundation and fine-tuned chat models. ArXiv, abs/2307.09288.
- MM-Vet: Evaluating large multimodal models for integrated capabilities.
- MMMU: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. ArXiv, abs/2311.16502.
- GLM-130B: An open bilingual pre-trained model. In The Eleventh International Conference on Learning Representations.
- OPT: open pre-trained transformer language models. CoRR, abs/2205.01068.
- M3EXAM: A multilingual, multimodal, multilevel benchmark for examining large language models.
- MiniGPT-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592.