EXAMS-V: A Multi-Discipline Multilingual Multimodal Exam Benchmark for Evaluating Vision Language Models (2403.10378v1)

Published 15 Mar 2024 in cs.CL and cs.CV

Abstract: We introduce EXAMS-V, a new challenging multi-discipline multimodal multilingual exam benchmark for evaluating vision LLMs. It consists of 20,932 multiple-choice questions across 20 school disciplines covering natural science, social science, and other miscellaneous studies, e.g., religion, fine arts, business, etc. EXAMS-V includes a variety of multimodal features such as text, images, tables, figures, diagrams, maps, scientific symbols, and equations. The questions come in 11 languages from 7 language families. Unlike existing benchmarks, EXAMS-V is uniquely curated by gathering school exam questions from various countries, with a variety of education systems. This distinctive approach calls for intricate reasoning across diverse languages and relies on region-specific knowledge. Solving the problems in the dataset requires advanced perception and joint reasoning over the text and the visual content of the image. Our evaluation results demonstrate that this is a challenging dataset, which is difficult even for advanced vision-text models such as GPT-4V and Gemini; this underscores the inherent complexity of the dataset and its significance as a future benchmark.

References (33)

Citations (5)

View on Semantic Scholar

Summary

The paper presents EXAMS-V, a benchmark featuring 20,932 exam questions in 20 subjects and 11 languages to test vision-language models.
It details a robust methodology with extensive data collection and multimodal annotation to capture real-world exam complexities.
Experimental results reveal that even top models struggle with EXAMS-V, underscoring the need for future improvements in multilingual and multimodal reasoning.

EXAMS-V: Evaluating Vision LLMs Across Multilingual and Multidisciplinary Domains

Introduction

The paper introduces EXAMS-V, a comprehensive and challenging benchmark designed for the evaluation of Vision LLMs (VLMs) across multiple disciplines and languages. This novel dataset offers a unique combination of features, including a wide range of school subjects, support for multiple languages, and a diverse set of multimodal content. The development of EXAMS-V represents a significant step towards enhancing our understanding and evaluation of VLMs, especially in terms of their multilingual capabilities and their ability to reason over complex, multimodal information.

EXAMS-V Dataset: Composition and Characteristics

EXAMS-V stands out with its rich dataset features, designed to push the capabilities of current VLMs. The dataset includes 20,932 questions across 20 school subjects, such as natural science, social science, and various applied studies. These questions are not only textual but also include a variety of visual elements like images, diagrams, scientific symbols, and tables, demanding advanced perception and reasoning skills from the models. Moreover, the dataset's multilingual aspect, with questions provided in 11 languages from 7 language families, introduces an additional layer of complexity, emphasizing the need for models to have strong multilingual and multimodal understanding.

Data Collection and Preprocessing

The construction of EXAMS-V involved meticulous data collection and preparation, ensuring a wide coverage of subjects and languages. By gathering school exam questions from diverse countries and education systems, the dataset mirrors real-world complexity and variety in question formatting and content. The subsequent preprocessing steps, including PDF to image conversion and detailed annotation, were aimed at preserving the integrity and multimodal nature of the original exam questions.

Dataset Statistics and Comparison

A closer look at the dataset reveals its expansive scope and diversity:

11 Languages: From high-resource languages like English and Chinese to low-resource ones like Bulgarian and Croatian.
20 Subjects: Spanning across natural sciences, social sciences, and other miscellaneous studies.
Multimodal Content: Rich in visual elements requiring intricate reasoning beyond simple text comprehension.

When compared to existing benchmarks, EXAMS-V's unique approach of integrating text and visual elements within images for each question sets it apart, making it a formidable challenge for even the most advanced VLMs.

Experimental Setup and Evaluation

The evaluation of EXAMS-V involved a range of state-of-the-art VLMs, including GPT-4V and Gemini, under a zero-shot setting. This approach aimed to assess the models' abilities to reason over and understand the dataset's complex, multimodal, and multilingual content without prior fine-tuning or specific model adjustments.

Results and Analysis

The experimental results underscore the challenging nature of EXAMS-V. Even high-performing VLMs like GPT-4V struggled to achieve scores significantly above the baseline, indicating a substantial gap between current model capabilities and the dataset's demands. These findings highlight EXAMS-V's value as a benchmark, emphasizing the need for further research and development in VLMs to improve their performance on complex, real-world tasks.

Conclusions and Future Directions

The introduction of EXAMS-V marks an important milestone in the evaluation of VLMs, particularly in the context of multilingual and multimodal understanding. The dataset's complexity and diversity present a substantial challenge, pointing out clear directions for future research in the field of artificial intelligence. Future work could focus on expanding the dataset further, incorporating more languages, modalities, and subjects to continue pushing the boundaries of what VLMs can achieve.

PDF Markdown

Related Papers

Tweets

https://twitter.com/RocktimJyotiDa2/status/1769765607073792420