Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
129 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

EXAMS-V: A Multi-Discipline Multilingual Multimodal Exam Benchmark for Evaluating Vision Language Models (2403.10378v1)

Published 15 Mar 2024 in cs.CL and cs.CV

Abstract: We introduce EXAMS-V, a new challenging multi-discipline multimodal multilingual exam benchmark for evaluating vision LLMs. It consists of 20,932 multiple-choice questions across 20 school disciplines covering natural science, social science, and other miscellaneous studies, e.g., religion, fine arts, business, etc. EXAMS-V includes a variety of multimodal features such as text, images, tables, figures, diagrams, maps, scientific symbols, and equations. The questions come in 11 languages from 7 language families. Unlike existing benchmarks, EXAMS-V is uniquely curated by gathering school exam questions from various countries, with a variety of education systems. This distinctive approach calls for intricate reasoning across diverse languages and relies on region-specific knowledge. Solving the problems in the dataset requires advanced perception and joint reasoning over the text and the visual content of the image. Our evaluation results demonstrate that this is a challenging dataset, which is difficult even for advanced vision-text models such as GPT-4V and Gemini; this underscores the inherent complexity of the dataset and its significance as a future benchmark.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (33)
  1. Flamingo: a visual language model for few-shot learning.
  2. Gemini: A family of highly capable multimodal models.
  3. VQA: Visual Question Answering. In International Conference on Computer Vision (ICCV).
  4. Qwen-VL: A frontier large vision-language model with versatile abilities. ArXiv, abs/2308.12966.
  5. Language models are few-shot learners. ArXiv, abs/2005.14165.
  6. VisualGPT: Data-efficient adaptation of pretrained language models for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18030–18040.
  7. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  8. Palm: Scaling language modeling with pathways. CoRR, abs/2204.02311.
  9. VizWiz grand challenge: Answering visual questions from blind people.
  10. EXAMS: A multi-subject high school examinations dataset for cross-lingual and multilingual question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5427–5444, Online. Association for Computational Linguistics.
  11. Measuring massive multitask language understanding. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net.
  12. Drew A. Hudson and Christopher D. Manning. 2019. GQA: A new dataset for real-world visual reasoning and compositional question answering.
  13. Large language models only pass primary school exams in Indonesia: A comprehensive test on IndoMMLU. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12359–12374, Singapore. Association for Computational Linguistics.
  14. Bactrian-X: A multilingual replicable instruction-following model with low-rank adaptation. CoRR, abs/2305.15011.
  15. CMMLU: Measuring massive multitask language understanding in chinese.
  16. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models.
  17. Visual instruction tuning. In NeurIPS.
  18. MMBench: Is your multi-modal model an all-around player?
  19. LLM360: Towards fully transparent open-source llms.
  20. MathVista: Evaluating math reasoning in visual contexts with GPT-4V, Bard, and other large multimodal models. ArXiv, abs/2310.02255.
  21. Learn to explain: Multimodal reasoning via thought chains for science question answering. In The 36th Conference on Neural Information Processing Systems (NeurIPS).
  22. OpenAI. 2023. GPT-4 technical report.
  23. BLOOM: A 176b-parameter open-access multilingual language model. CoRR, abs/2211.05100.
  24. Jais and Jais-chat: Arabic-centric foundation and instruction-tuned open generative large language models. CoRR, abs/2308.16149.
  25. Towards vqa models that can read.
  26. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  27. Llama 2: Open foundation and fine-tuned chat models. ArXiv, abs/2307.09288.
  28. MM-Vet: Evaluating large multimodal models for integrated capabilities.
  29. MMMU: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. ArXiv, abs/2311.16502.
  30. GLM-130B: An open bilingual pre-trained model. In The Eleventh International Conference on Learning Representations.
  31. OPT: open pre-trained transformer language models. CoRR, abs/2205.01068.
  32. M3EXAM: A multilingual, multimodal, multilevel benchmark for examining large language models.
  33. MiniGPT-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592.
Citations (5)

Summary

  • The paper presents EXAMS-V, a benchmark featuring 20,932 exam questions in 20 subjects and 11 languages to test vision-language models.
  • It details a robust methodology with extensive data collection and multimodal annotation to capture real-world exam complexities.
  • Experimental results reveal that even top models struggle with EXAMS-V, underscoring the need for future improvements in multilingual and multimodal reasoning.

EXAMS-V: Evaluating Vision LLMs Across Multilingual and Multidisciplinary Domains

Introduction

The paper introduces EXAMS-V, a comprehensive and challenging benchmark designed for the evaluation of Vision LLMs (VLMs) across multiple disciplines and languages. This novel dataset offers a unique combination of features, including a wide range of school subjects, support for multiple languages, and a diverse set of multimodal content. The development of EXAMS-V represents a significant step towards enhancing our understanding and evaluation of VLMs, especially in terms of their multilingual capabilities and their ability to reason over complex, multimodal information.

EXAMS-V Dataset: Composition and Characteristics

EXAMS-V stands out with its rich dataset features, designed to push the capabilities of current VLMs. The dataset includes 20,932 questions across 20 school subjects, such as natural science, social science, and various applied studies. These questions are not only textual but also include a variety of visual elements like images, diagrams, scientific symbols, and tables, demanding advanced perception and reasoning skills from the models. Moreover, the dataset's multilingual aspect, with questions provided in 11 languages from 7 language families, introduces an additional layer of complexity, emphasizing the need for models to have strong multilingual and multimodal understanding.

Data Collection and Preprocessing

The construction of EXAMS-V involved meticulous data collection and preparation, ensuring a wide coverage of subjects and languages. By gathering school exam questions from diverse countries and education systems, the dataset mirrors real-world complexity and variety in question formatting and content. The subsequent preprocessing steps, including PDF to image conversion and detailed annotation, were aimed at preserving the integrity and multimodal nature of the original exam questions.

Dataset Statistics and Comparison

A closer look at the dataset reveals its expansive scope and diversity:

  • 11 Languages: From high-resource languages like English and Chinese to low-resource ones like Bulgarian and Croatian.
  • 20 Subjects: Spanning across natural sciences, social sciences, and other miscellaneous studies.
  • Multimodal Content: Rich in visual elements requiring intricate reasoning beyond simple text comprehension.

When compared to existing benchmarks, EXAMS-V's unique approach of integrating text and visual elements within images for each question sets it apart, making it a formidable challenge for even the most advanced VLMs.

Experimental Setup and Evaluation

The evaluation of EXAMS-V involved a range of state-of-the-art VLMs, including GPT-4V and Gemini, under a zero-shot setting. This approach aimed to assess the models' abilities to reason over and understand the dataset's complex, multimodal, and multilingual content without prior fine-tuning or specific model adjustments.

Results and Analysis

The experimental results underscore the challenging nature of EXAMS-V. Even high-performing VLMs like GPT-4V struggled to achieve scores significantly above the baseline, indicating a substantial gap between current model capabilities and the dataset's demands. These findings highlight EXAMS-V's value as a benchmark, emphasizing the need for further research and development in VLMs to improve their performance on complex, real-world tasks.

Conclusions and Future Directions

The introduction of EXAMS-V marks an important milestone in the evaluation of VLMs, particularly in the context of multilingual and multimodal understanding. The dataset's complexity and diversity present a substantial challenge, pointing out clear directions for future research in the field of artificial intelligence. Future work could focus on expanding the dataset further, incorporating more languages, modalities, and subjects to continue pushing the boundaries of what VLMs can achieve.

X Twitter Logo Streamline Icon: https://streamlinehq.com