SceMQA: A Scientific College Entrance Level Multimodal Question Answering Benchmark (2402.05138v1)

Published 6 Feb 2024 in cs.AI and cs.CL

Abstract: The paper introduces SceMQA, a novel benchmark for scientific multimodal question answering at the college entrance level. It addresses a critical educational phase often overlooked in existing benchmarks, spanning high school to pre-college levels. SceMQA focuses on core science subjects including Mathematics, Physics, Chemistry, and Biology. It features a blend of multiple-choice and free-response formats, ensuring a comprehensive evaluation of AI models' abilities. Additionally, our benchmark provides specific knowledge points for each problem and detailed explanations for each answer. SceMQA also uniquely presents problems with identical contexts but varied questions to facilitate a more thorough and accurate assessment of reasoning capabilities. In the experiment, we evaluate both open-source and close-source state-of-the-art Multimodal LLMs (MLLMs), across various experimental settings. The results show that further research and development are needed in developing more capable MLLM, as highlighted by only 50% to 60% accuracy achieved by the strongest models. Our benchmark and analysis will be available at https://scemqa.github.io/

References (48)

Citations (10)

View on Semantic Scholar

Summary

The paper introduces SceMQA, a comprehensive benchmark challenging AI with college entrance level science questions using both multiple-choice and free-response formats.
It evaluates state-of-the-art multimodal models, revealing performance gaps, particularly in mathematics and physics compared to chemistry and biology.
Error analysis uncovers reasoning flaws, image interpretation issues, and domain-specific knowledge limitations, guiding future improvements in AI research.

Introduction to SceMQA

The development of benchmarks in the field of AI, particularly within the domains of Multimodal LLMs (MLLMs), constitutes an essential frontier for evaluating and pushing the boundaries of current technological capabilities. This blog post explores a novel benchmark called Science College Entrance Level Multimodal Question Answering (SceMQA), designed to assess AI models' aptitude in answering scientific questions across core subjects: Mathematics, Physics, Chemistry, and Biology, at the critical educational phase of high school to college entrance levels.

Dissecting SceMQA

SceMQA stands out due to its comprehensive approach in assessing the AI models. The benchmark comprises multiple-choice and free-response problem formats, which are instrumental in evaluating models across a spectrum of computational and reasoning tasks. Notably, SceMQA provides an intricate array of problems, each coupled with detailed explanations, thereby ensuring transparency and a deeper understanding of the model's reasoning process. This unique collection facilitates an exhaustive evaluation, significantly benefiting from the inclusion of problems that share identical contexts but differ in questions, challenging the models to develop a profound semantic understanding rather than relying on memorization or pattern recognition.

Review of State-of-the-Art MLLMs Performance on SceMQA

The performance of both open-source and close-source MLLMs on SceMQA demonstrates the current state and challenges faced by AI in scientific problem-solving domains. Noteworthy conclusions can be drawn from the experimental examinations:

The best-performing models only achieved 50% to 60% accuracy, underscoring the pressing need for further advancements in MLLM capabilities.
Close-sourced models, although leading in performance compared to their open-source counterparts, still highlight a significant gap towards achieving human-level accuracy and understanding.
A closer look into question-specific performance revealed that models generally fared better in Chemistry and Biology over Mathematics and Physics. This suggests a discernible challenge in domains requiring precise computational abilities or intricate reasoning with scientific images and diagrams.

Insights into MLLM Limitations

An in-depth error analysis conducted on state-of-the-art MLLMs, such as GPT4-V, sheds light on prevalent challenges and offers directions for future research:

Reasoning errors were common, pointing towards a gap in models' ability to construct and follow complex logical reasoning chains accurately.
Models exhibited image perception errors, particularly in accurately interpreting diagrams or tables, suggesting an area for improvement in visual processing capabilities.
A significant portion of errors was attributed to a lack of domain-specific knowledge, indicating the necessity for enriching training materials with diverse and comprehensive educational content.

Concluding Thoughts

SceMQA emerges as a pivotal benchmark in defining the next steps for research in MLLMs, particularly for applications in education and scientific research. By highlighting the existing limitations of AI models through a meticulously designed set of multimodal questions, SceMQA not only benchmarks current technologies but also outlines a roadmap for future advancements. The endeavor towards developing MLLMs capable of approaching or surpassing human-level performance in scientific domains is ongoing. SceMQA represents a step forward in this journey, providing valuable insights and challenging the AI community to push the boundaries of what these models can achieve.

As we move forward, the expansion of benchmarks like SceMQA, coupled with advancements in model capabilities, promises to revolutionize the application of AI in scientific comprehension and educational assistance, ultimately contributing to accelerating scientific discoveries and enhancing learning experiences.

PDF Markdown

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Authors (10)

GitHub

Tweets

https://twitter.com/LiangZhenwen/status/1755760308436787625

https://twitter.com/LiangZhenwen/status/1802925770576417103

https://twitter.com/gastronomy/status/1755819759801696713