Kaleidoscope: In-language Exams for Massively Multilingual Vision Evaluation (2504.07072v2)

Published 9 Apr 2025 in cs.CL and cs.CV

Abstract: The evaluation of vision-LLMs (VLMs) has mainly relied on English-language benchmarks, leaving significant gaps in both multilingual and multicultural coverage. While multilingual benchmarks have expanded, both in size and languages, many rely on translations of English datasets, failing to capture cultural nuances. In this work, we propose Kaleidoscope, as the most comprehensive exam benchmark to date for the multilingual evaluation of vision-LLMs. Kaleidoscope is a large-scale, in-language multimodal benchmark designed to evaluate VLMs across diverse languages and visual inputs. Kaleidoscope covers 18 languages and 14 different subjects, amounting to a total of 20,911 multiple-choice questions. Built through an open science collaboration with a diverse group of researchers worldwide, Kaleidoscope ensures linguistic and cultural authenticity. We evaluate top-performing multilingual vision-LLMs and find that they perform poorly on low-resource languages and in complex multimodal scenarios. Our results highlight the need for progress on culturally inclusive multimodal evaluation frameworks.

Summary

The paper introduces Kaleidoscope, a comprehensive in-language benchmark with 20,911 questions in 18 languages across 14 subjects to evaluate vision-language models beyond English.
Model performance varies significantly across languages, subjects (better in humanities than STEM), and input types (struggles with complex visuals and low-resource languages), highlighting VLM limitations.
Kaleidoscope identifies key areas for improving VLM training, particularly for multimodal reasoning and low-resource language handling, to foster more inclusive and globally capable AI.

Essay on "Kaleidoscope: In-language Exams for Massively Multilingual Vision Evaluation"

The paper "Kaleidoscope: In-language Exams for Massively Multilingual Vision Evaluation" introduces a novel benchmark designed to evaluate vision-LLMs (VLMs) across multiple languages and subjects. Unlike existing benchmarks, which are predominantly confined to English and often rely on translations that fail to capture cultural nuances, Kaleidoscope is lauded for its inclusivity and cultural authenticity. The benchmark comprises 20,911 multiple-choice questions spanning 18 languages and 14 subjects, making it one of the most comprehensive in-language multimodal benchmarks available.

The paper begins by addressing the predominant issue in current VLM evaluations: the English-centric approach which significantly limits the evaluation of models capable of handling diverse linguistic and cultural inputs. By developing Kaleidoscope, the authors aim to bridge this gap, providing a benchmark that not only evaluates linguistic capacity but also cultural comprehension and multimodal reasoning.

The methodology section outlines a rigorous data collection process that leverages an open science collaboration model. This approach involves contributors from around the globe, ensuring not only linguistic diversity but also cultural relevance in the questions. Over 50% of the questions require image interpretation, offering a comprehensive challenge to current VLMs which often exhibit disparity in performance across different modalities. Particularly, models showed a marked decline in accuracy when tasked with questions that require this multimodal approach, emphasizing a key area for future research and model refinement.

In terms of model performance, the paper evaluates a range of cutting-edge models, including both open-weight models such as the Aya-Vision, Molmo, and the Qwen model family, as well as closed models like GPT-4o, Claude, and Gemini. Across the board, a consistent finding is that models perform admirably on high-resource languages and simpler textual inputs but substantially less well with low-resource languages and when visual information is pivotal. This underscores an evident gap in current VLMs and pinpoints the necessity of improving visual context handling and reasoning abilities.

Kaleidoscope also highlights domain-specific disparities, with models performing significantly better on humanities and social science questions compared to STEM subjects. This finding suggests a potential limitation in models' abilities to apply complex reasoning and problem-solving skills to scientific and technical domains, which often require precise interpretation of diagrams, graphs, and formulas.

One of the paper’s crucial technical insights is the identification of performance variances based on image types. Models generally perform better with photographs and simple text-based images compared to more complex visual data. Such insights are invaluable for guiding future training strategies where models are consistently exposed to a broader and more challenging set of visual data.

In terms of practical applications, the authors propose Kaleidoscope as a tool not only for evaluating current model limitations but also as a means of guiding future developments in VLM design and training. By ensuring the presence of a balanced dataset for both high- and low-resource languages, the benchmark could potentially drive more inclusive AI systems that cater to a global audience, accounting for linguistic and cultural diversity.

The paper further presents potential future directions, such as automated generation of linguistically and culturally diverse datasets and improved training algorithms that can handle complex multimodal inputs effectively. It points towards a broader vision of AI that transcends narrow linguistic confines and engages meaningfully with global cultural contexts.

In conclusion, "Kaleidoscope: In-language Exams for Massively Multilingual Vision Evaluation" represents a pivotal step towards creating inclusive, culturally, and linguistically diverse benchmarks that can more effectively measure VLM capabilities. It highlights the need for more robust models and suggests avenues for refining AI to be more representative and effective across varied global contexts. This work lays the groundwork for developing AI systems that are equitable and capable of performing across a broad spectrum of human languages and cultural nuances.