Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
120 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
3 tokens/sec
DeepSeek R1 via Azure Pro
55 tokens/sec
2000 character limit reached

SceMQA: A Scientific College Entrance Level Multimodal Question Answering Benchmark (2402.05138v1)

Published 6 Feb 2024 in cs.AI and cs.CL

Abstract: The paper introduces SceMQA, a novel benchmark for scientific multimodal question answering at the college entrance level. It addresses a critical educational phase often overlooked in existing benchmarks, spanning high school to pre-college levels. SceMQA focuses on core science subjects including Mathematics, Physics, Chemistry, and Biology. It features a blend of multiple-choice and free-response formats, ensuring a comprehensive evaluation of AI models' abilities. Additionally, our benchmark provides specific knowledge points for each problem and detailed explanations for each answer. SceMQA also uniquely presents problems with identical contexts but varied questions to facilitate a more thorough and accurate assessment of reasoning capabilities. In the experiment, we evaluate both open-source and close-source state-of-the-art Multimodal LLMs (MLLMs), across various experimental settings. The results show that further research and development are needed in developing more capable MLLM, as highlighted by only 50% to 60% accuracy achieved by the strongest models. Our benchmark and analysis will be available at https://scemqa.github.io/

Definition Search Book Streamline Icon: https://streamlinehq.com
References (48)
  1. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433.
  2. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pages 41–48.
  3. Science in the age of large language models. Nature Reviews Physics, pages 1–4.
  4. Visit-bench: A benchmark for vision-language instruction following inspired by real-world use. Advances in Neural Information Processing Systems.
  5. Language models are few-shot learners. Advances in Neural Information Processing Systems, 33:1877–1901.
  6. Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500.
  7. Google. 2023. Introducing gemini: our largest and most capable ai model.
  8. Large language model based multi-agents: A survey of progress and challenges.
  9. What indeed can gpt models do in chemistry? a comprehensive benchmark on eight tasks. arXiv preprint arXiv:2305.18365.
  10. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR).
  11. Measuring mathematical problem solving with the math dataset. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2).
  12. Large language models are reasoning teachers. arXiv preprint arXiv:2212.10071.
  13. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. arXiv preprint arXiv:2305.02301.
  14. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. arXiv preprint arXiv:2305.08322.
  15. Solving quantitative reasoning problems with language models. Advances in Neural Information Processing Systems, 35:3843–3857.
  16. Camel: Communicative agents for" mind" exploration of large language model society. In Thirty-seventh Conference on Neural Information Processing Systems.
  17. Zhenwen Liang and Xiangliang Zhang. 2021. Solving math word problems with teacher supervision. In IJCAI, pages 3522–3528.
  18. Visual instruction tuning. Advances in Neural Information Processing Systems.
  19. On the hidden mystery of ocr in large multimodal models. arXiv preprint arXiv:2305.07895.
  20. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv preprint arXiv:2310.02255.
  21. Learn to explain: Multimodal reasoning via thought chains for science question answering. In Advances in Neural Information Processing Systems.
  22. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. In Findings of the Association for Computational Linguistics: ACL 2022, pages 2263–2279.
  23. Orca 2: Teaching small language models how to reason. arXiv preprint arXiv:2311.11045.
  24. OpenAI. 2023. GPT-4 Technical Report.
  25. OpenAI. 2023. GPT-4V(ision) System Card. https://cdn.openai.com/papers/GPTV_System_Card.pdf.
  26. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  27. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, pages 1–22.
  28. Are nlp models really able to solve simple math word problems? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2080–2094.
  29. Mathematical discoveries from program search with large language models. Nature, pages 1–3.
  30. Automatic generation of socratic subquestions for teaching math word problems. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 4136–4149, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  31. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  32. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  33. Scibench: Evaluating college-level scientific problem-solving abilities of large language models. arXiv preprint arXiv:2307.10635.
  34. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
  35. Multimodal large language models: A survey. arXiv preprint arXiv:2311.13165.
  36. Autogen: Enabling next-gen llm applications via multi-agent conversation framework. arXiv preprint arXiv:2308.08155.
  37. Contrastive post-training large language models on data curriculum. arXiv preprint arXiv:2310.02263.
  38. Unbiased math word problems benchmark for mitigating solving bias. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 1401–1408, Seattle, United States. Association for Computational Linguistics.
  39. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178.
  40. A survey on multimodal large language models. arXiv preprint arXiv:2306.13549.
  41. Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284.
  42. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. arXiv preprint arXiv:2311.16502.
  43. Mammoth: Building math generalist models through hybrid instruction tuning. arXiv preprint arXiv:2309.05653.
  44. Noahqa: Numerical reasoning with interpretable graph question answering dataset. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 4147–4161.
  45. Evaluating the performance of large language models on gaokao benchmark. arXiv preprint arXiv:2305.12474.
  46. Llavar: Enhanced visual instruction tuning for text-rich image understanding. arXiv preprint arXiv:2306.17107.
  47. Agieval: A human-centric benchmark for evaluating foundation models. arXiv preprint arXiv:2304.06364.
  48. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592.
Citations (10)

Summary

  • The paper introduces SceMQA, a comprehensive benchmark challenging AI with college entrance level science questions using both multiple-choice and free-response formats.
  • It evaluates state-of-the-art multimodal models, revealing performance gaps, particularly in mathematics and physics compared to chemistry and biology.
  • Error analysis uncovers reasoning flaws, image interpretation issues, and domain-specific knowledge limitations, guiding future improvements in AI research.

Introduction to SceMQA

The development of benchmarks in the field of AI, particularly within the domains of Multimodal LLMs (MLLMs), constitutes an essential frontier for evaluating and pushing the boundaries of current technological capabilities. This blog post explores a novel benchmark called Science College Entrance Level Multimodal Question Answering (SceMQA), designed to assess AI models' aptitude in answering scientific questions across core subjects: Mathematics, Physics, Chemistry, and Biology, at the critical educational phase of high school to college entrance levels.

Dissecting SceMQA

SceMQA stands out due to its comprehensive approach in assessing the AI models. The benchmark comprises multiple-choice and free-response problem formats, which are instrumental in evaluating models across a spectrum of computational and reasoning tasks. Notably, SceMQA provides an intricate array of problems, each coupled with detailed explanations, thereby ensuring transparency and a deeper understanding of the model's reasoning process. This unique collection facilitates an exhaustive evaluation, significantly benefiting from the inclusion of problems that share identical contexts but differ in questions, challenging the models to develop a profound semantic understanding rather than relying on memorization or pattern recognition.

Review of State-of-the-Art MLLMs Performance on SceMQA

The performance of both open-source and close-source MLLMs on SceMQA demonstrates the current state and challenges faced by AI in scientific problem-solving domains. Noteworthy conclusions can be drawn from the experimental examinations:

  • The best-performing models only achieved 50% to 60% accuracy, underscoring the pressing need for further advancements in MLLM capabilities.
  • Close-sourced models, although leading in performance compared to their open-source counterparts, still highlight a significant gap towards achieving human-level accuracy and understanding.
  • A closer look into question-specific performance revealed that models generally fared better in Chemistry and Biology over Mathematics and Physics. This suggests a discernible challenge in domains requiring precise computational abilities or intricate reasoning with scientific images and diagrams.

Insights into MLLM Limitations

An in-depth error analysis conducted on state-of-the-art MLLMs, such as GPT4-V, sheds light on prevalent challenges and offers directions for future research:

  • Reasoning errors were common, pointing towards a gap in models' ability to construct and follow complex logical reasoning chains accurately.
  • Models exhibited image perception errors, particularly in accurately interpreting diagrams or tables, suggesting an area for improvement in visual processing capabilities.
  • A significant portion of errors was attributed to a lack of domain-specific knowledge, indicating the necessity for enriching training materials with diverse and comprehensive educational content.

Concluding Thoughts

SceMQA emerges as a pivotal benchmark in defining the next steps for research in MLLMs, particularly for applications in education and scientific research. By highlighting the existing limitations of AI models through a meticulously designed set of multimodal questions, SceMQA not only benchmarks current technologies but also outlines a roadmap for future advancements. The endeavor towards developing MLLMs capable of approaching or surpassing human-level performance in scientific domains is ongoing. SceMQA represents a step forward in this journey, providing valuable insights and challenging the AI community to push the boundaries of what these models can achieve.

As we move forward, the expansion of benchmarks like SceMQA, coupled with advancements in model capabilities, promises to revolutionize the application of AI in scientific comprehension and educational assistance, ultimately contributing to accelerating scientific discoveries and enhancing learning experiences.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.