Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 45 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 11 tok/s Pro
GPT-5 High 19 tok/s Pro
GPT-4o 88 tok/s Pro
Kimi K2 214 tok/s Pro
GPT OSS 120B 460 tok/s Pro
Claude Sonnet 4 38 tok/s Pro
2000 character limit reached

WikiMixQA: A Multimodal Benchmark for Question Answering over Tables and Charts (2506.15594v1)

Published 18 Jun 2025 in cs.CL, cs.AI, and cs.LG

Abstract: Documents are fundamental to preserving and disseminating information, often incorporating complex layouts, tables, and charts that pose significant challenges for automatic document understanding (DU). While vision-language large models (VLLMs) have demonstrated improvements across various tasks, their effectiveness in processing long-context vision inputs remains unclear. This paper introduces WikiMixQA, a benchmark comprising 1,000 multiple-choice questions (MCQs) designed to evaluate cross-modal reasoning over tables and charts extracted from 4,000 Wikipedia pages spanning seven distinct topics. Unlike existing benchmarks, WikiMixQA emphasizes complex reasoning by requiring models to synthesize information from multiple modalities. We evaluate 12 state-of-the-art vision-LLMs, revealing that while proprietary models achieve ~70% accuracy when provided with direct context, their performance deteriorates significantly when retrieval from long documents is required. Among these, GPT-4-o is the only model exceeding 50% accuracy in this setting, whereas open-source models perform considerably worse, with a maximum accuracy of 27%. These findings underscore the challenges of long-context, multi-modal reasoning and establish WikiMixQA as a crucial benchmark for advancing document understanding research.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

An Examination of WikiMixQA: Advancing Multimodal Document Understanding

The paper introduces WikiMixQA, a benchmark designed to evaluate question-answering (QA) capabilities over documents incorporating complex modalities such as tables and charts. This initiative stems from the growing need in document understanding (DU) to integrate information from diverse sources. Traditional NLP models have often struggled with the intricate layouts inherent in multimodal documents, particularly from sources like Wikipedia. Vision-LLMs (VLMs), while showing promise, face challenges when dealing with extended context scenarios.

Benchmark Design and Dataset Characteristics

WikiMixQA comprises 1,000 multiple-choice questions derived from a substantial dataset of tables and charts extracted from approximately 4,000 Wikipedia pages. This database spans seven domains: Economy, Geography, History, Politics, Science, Sport, and Wikimedia, offering a wide contextual scope. This benchmark's distinguishing characteristic is its focus on cross-modal reasoning, requiring synthesis of information across modalities—a task traditionally underrepresented in existing benchmarks.

The dataset development involved a rigorous pipeline: collection of multimodal Wikipedia articles, identification of semantically similar modality pairs, MCQ generation via GPT-4-turbo, and human annotation for quality assurance. This results in a dataset emphasizing complexity and multimodal reasoning.

Evaluation and Findings

The authors evaluated 12 state-of-the-art vision-LLMs against WikiMixQA across several configurations: with no context, with explicit context (oracle setting), and within long documents (wikidoc setting). Models such as GPT-4-o displayed notable performance differences across settings, achieving accuracy levels of approximately 71% in oracle settings but dropping significantly in long-context scenarios. Conversely, open-source models capped at 27% accuracy in similar settings.

These findings underscore the difficulties in DU tasks involving long contexts and multiple modalities. While models can perform well with direct information, retrieving and contextualizing relevant data from lengthy documents remains a significant challenge. This gap highlights the need for more sophisticated model architectures or training paradigms to enhance DU capabilities.

Implications and Future Directions

WikiMixQA establishes itself as a pivotal resource for advancing document understanding research. By focusing on multimodal reasoning within long-context documents, it challenges existing models while illuminating current deficiencies. Practically, these insights could fuel developments in more adaptive DU algorithms, ensuring their applicability in diverse real-world scenarios.

Theoretically, WikiMixQA could guide model improvements towards better handling of visual and textual synergy, driving improvements in architectures that more effectively parse and reason over integrated modal data. Further exploration could involve enhancing multimodal LLMs' capabilities, potentially catalyzing innovations that bridge the performance gap between human and machine document interpretation.

Overall, this benchmark is a robust tool for both evaluating and fostering advancements in the complex landscape of automatic document understanding, particularly in its support for the DU community's endeavor to transcend the limitations of current models.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-Up Questions

We haven't generated follow-up questions for this paper yet.