Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

Gemini 2.5 Flash 99 tok/s

Gemini 2.5 Pro 43 tok/s Pro

GPT-5 Medium 28 tok/s

GPT-5 High 35 tok/s Pro

GPT-4o 94 tok/s

GPT OSS 120B 476 tok/s Pro

Kimi K2 190 tok/s Pro

2000 character limit reached

MMSciBench: Benchmarking Language Models on Chinese Multimodal Scientific Problems (2503.01891v2)

Published 27 Feb 2025 in cs.LG and cs.CL

Abstract: Recent advances in LLMs and vision-LLMs (LVLMs) have shown promise across many tasks, yet their scientific reasoning capabilities remain untested, particularly in multimodal settings. We present MMSciBench, a benchmark for evaluating mathematical and physical reasoning through text-only and text-image formats, with human-annotated difficulty levels, solutions with detailed explanations, and taxonomic mappings. Evaluation of state-of-the-art models reveals significant limitations, with even the best model achieving only \textbf{63.77\%} accuracy and particularly struggling with visual reasoning tasks. Our analysis exposes critical gaps in complex reasoning and visual-textual integration, establishing MMSciBench as a rigorous standard for measuring progress in multimodal scientific understanding. The code for MMSciBench is open-sourced at GitHub, and the dataset is available at Hugging Face.

Collections

Summary

The paper introduces MMSciBench, a benchmark evaluating LLMs and LVLMs on Chinese multimodal math and physics tasks using both multiple-choice and open-ended formats.
It employs a hierarchical taxonomy with human-annotated difficulty levels to rigorously assess performance in text-only versus text-image modalities.
Experiments show state-of-the-art models achieving up to 63.77% accuracy, highlighting significant room for improvement in visual and complex scientific reasoning.

MMSciBench: Benchmarking LLMs on Chinese Multimodal Scientific Problems

Introduction

"MMSciBench: Benchmarking LLMs on Chinese Multimodal Scientific Problems" introduces the MMSciBench benchmark to assess LLMs' (LLMs) and vision-LLMs' (LVLMs) capabilities in mathematical and physical reasoning within multimodal settings. This text-image benchmark is intended to fill gaps in current scientific evaluation practices, offering difficulty levels and a structured taxonomy of problems.

MMSciBench Overview

MMSciBench provides a comprehensive evaluation framework combining multiple-choice questions (MCQs) and open-ended question-answer pairs (Q{content}A) for math and physics. It includes text-only and text-image formats, allowing the direct comparison of models on unimodal versus multimodal tasks.

Figure 1: The overview of MMSciBench, describing the question distribution, dataset features, and the evaluation framework.

In addition to the extensive dataset collection process which ensures quality and rigor, this benchmark is paired with a hierarchical taxonomy categorizing scientific concepts into Domain, Module, and Chapter. The multimodal nature of the dataset is designed to enable a rigorous evaluation of models' scientific reasoning, with human-annotated difficulty levels guiding this evaluation.

Figure 2: The distribution of data in MMSciBench according to the first-level key knowledge points for each subject.

Evaluation and Analysis

The benchmark was tested using state-of-the-art LVLMs, including Gemini 1.5 Pro 002, Qwen2-VL-72B-Instruct, and Claude 3.5 Sonnet, alongside two math-specialized LLMs. Through these experiments, MMSciBench highlights key challenges facing current models, particularly the degradation of performance in open-ended tasks, visual-textual integration limitations, and struggles with complex reasoning.

Figure 3: Accuracies of models across different key knowledge points.

For instance, the best model achieved only 63.77% accuracy, noticeably struggling with multimodal tasks. Performance notably declined in open-ended problem-solving and visual reasoning, demonstrating the issues these AI models face in integrating visual context and maintaining accuracy in more complex tasks.

Implementation Considerations

MMSciBench is deployed with an open-source codebase, and datasets are accessible via Hugging Face, facilitating reproducibility and further exploration by the scientific community. The involvement of GPT-4o as an automated evaluator exemplifies the integration of existing resources to ensure consistent assessment. However, model failure to adhere to specified output formats suggests the need for robust prompting strategies to streamline model assessment.

Conclusion

MMSciBench sets a rigorous standard for evaluating LLM and LVLM performance in multimodal scientific reasoning. The finding that current models have significant room for improvement, especially in visual reasoning, establishes MMSciBench as a critical tool for both AI research and the development of models capable of sophisticated scientific understanding. Going forward, MMSciBench serves as both a challenge and a path forward for refining AI's capabilities in practical and theoretical scientific applications.

MMSciBench: Benchmarking Language Models on Chinese Multimodal Scientific Problems (2503.01891v2)

Collections

Summary

MMSciBench: Benchmarking LLMs on Chinese Multimodal Scientific Problems

Introduction

MMSciBench Overview

Evaluation and Analysis

Implementation Considerations

Conclusion

Paper Prompts

Follow-up Questions

Authors (5)

MMSciBench: Benchmarking Language Models on Chinese Multimodal Scientific Problems (2503.01891v2)

Collections

Summary

MMSciBench: Benchmarking LLMs on Chinese Multimodal Scientific Problems

Introduction

MMSciBench Overview

Evaluation and Analysis

Implementation Considerations

Conclusion

Paper Prompts

Follow-up Questions

Related Papers

Authors (5)