SceMQA: A Scientific College Entrance Level Multimodal Question Answering Benchmark (2402.05138v1)
Abstract: The paper introduces SceMQA, a novel benchmark for scientific multimodal question answering at the college entrance level. It addresses a critical educational phase often overlooked in existing benchmarks, spanning high school to pre-college levels. SceMQA focuses on core science subjects including Mathematics, Physics, Chemistry, and Biology. It features a blend of multiple-choice and free-response formats, ensuring a comprehensive evaluation of AI models' abilities. Additionally, our benchmark provides specific knowledge points for each problem and detailed explanations for each answer. SceMQA also uniquely presents problems with identical contexts but varied questions to facilitate a more thorough and accurate assessment of reasoning capabilities. In the experiment, we evaluate both open-source and close-source state-of-the-art Multimodal LLMs (MLLMs), across various experimental settings. The results show that further research and development are needed in developing more capable MLLM, as highlighted by only 50% to 60% accuracy achieved by the strongest models. Our benchmark and analysis will be available at https://scemqa.github.io/
- Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433.
- Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pages 41–48.
- Science in the age of large language models. Nature Reviews Physics, pages 1–4.
- Visit-bench: A benchmark for vision-language instruction following inspired by real-world use. Advances in Neural Information Processing Systems.
- Language models are few-shot learners. Advances in Neural Information Processing Systems, 33:1877–1901.
- Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500.
- Google. 2023. Introducing gemini: our largest and most capable ai model.
- Large language model based multi-agents: A survey of progress and challenges.
- What indeed can gpt models do in chemistry? a comprehensive benchmark on eight tasks. arXiv preprint arXiv:2305.18365.
- Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR).
- Measuring mathematical problem solving with the math dataset. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2).
- Large language models are reasoning teachers. arXiv preprint arXiv:2212.10071.
- Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. arXiv preprint arXiv:2305.02301.
- C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. arXiv preprint arXiv:2305.08322.
- Solving quantitative reasoning problems with language models. Advances in Neural Information Processing Systems, 35:3843–3857.
- Camel: Communicative agents for" mind" exploration of large language model society. In Thirty-seventh Conference on Neural Information Processing Systems.
- Zhenwen Liang and Xiangliang Zhang. 2021. Solving math word problems with teacher supervision. In IJCAI, pages 3522–3528.
- Visual instruction tuning. Advances in Neural Information Processing Systems.
- On the hidden mystery of ocr in large multimodal models. arXiv preprint arXiv:2305.07895.
- Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv preprint arXiv:2310.02255.
- Learn to explain: Multimodal reasoning via thought chains for science question answering. In Advances in Neural Information Processing Systems.
- Chartqa: A benchmark for question answering about charts with visual and logical reasoning. In Findings of the Association for Computational Linguistics: ACL 2022, pages 2263–2279.
- Orca 2: Teaching small language models how to reason. arXiv preprint arXiv:2311.11045.
- OpenAI. 2023. GPT-4 Technical Report.
- OpenAI. 2023. GPT-4V(ision) System Card. https://cdn.openai.com/papers/GPTV_System_Card.pdf.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
- Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, pages 1–22.
- Are nlp models really able to solve simple math word problems? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2080–2094.
- Mathematical discoveries from program search with large language models. Nature, pages 1–3.
- Automatic generation of socratic subquestions for teaching math word problems. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 4136–4149, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Scibench: Evaluating college-level scientific problem-solving abilities of large language models. arXiv preprint arXiv:2307.10635.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
- Multimodal large language models: A survey. arXiv preprint arXiv:2311.13165.
- Autogen: Enabling next-gen llm applications via multi-agent conversation framework. arXiv preprint arXiv:2308.08155.
- Contrastive post-training large language models on data curriculum. arXiv preprint arXiv:2310.02263.
- Unbiased math word problems benchmark for mitigating solving bias. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 1401–1408, Seattle, United States. Association for Computational Linguistics.
- mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178.
- A survey on multimodal large language models. arXiv preprint arXiv:2306.13549.
- Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284.
- Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. arXiv preprint arXiv:2311.16502.
- Mammoth: Building math generalist models through hybrid instruction tuning. arXiv preprint arXiv:2309.05653.
- Noahqa: Numerical reasoning with interpretable graph question answering dataset. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 4147–4161.
- Evaluating the performance of large language models on gaokao benchmark. arXiv preprint arXiv:2305.12474.
- Llavar: Enhanced visual instruction tuning for text-rich image understanding. arXiv preprint arXiv:2306.17107.
- Agieval: A human-centric benchmark for evaluating foundation models. arXiv preprint arXiv:2304.06364.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592.