MMSci: A Dataset for Graduate-Level Multi-Discipline Multimodal Scientific Understanding (2407.04903v2)

Published 6 Jul 2024 in cs.CL, cs.AI, and cs.CV

Abstract: The rapid development of Multimodal LLMs (MLLMs) is making AI-driven scientific assistants increasingly feasible, with interpreting scientific figures being a crucial task. However, existing datasets and benchmarks focus mainly on basic charts and limited science subjects, lacking comprehensive evaluations. To address this, we curated a multimodal, multidisciplinary dataset from peer-reviewed, open-access Nature Communications articles, spanning 72 scientific disciplines. This dataset includes figures such as schematic diagrams, simulated images, macroscopic/microscopic photos, and experimental visualizations (e.g., western blots), which often require graduate-level, discipline-specific expertise to interpret. We developed benchmarks for scientific figure captioning and multiple-choice questions, evaluating six proprietary and over ten open-source models across varied settings. The results highlight the high difficulty of these tasks and the significant performance gap among models. While many open-source models performed at chance level on the multiple-choice task, some matched the performance of proprietary models. However, the gap was more pronounced in the captioning task. Our dataset also provide valuable resource for training. Fine-tuning the Qwen2-VL-2B model with our task-specific multimodal training data improved its multiple-choice accuracy to a level comparable to GPT-4o, though captioning remains challenging. Continuous pre-training of MLLMs using our interleaved article and figure data enhanced their material generation capabilities, demonstrating potential for integrating scientific knowledge. The dataset and benchmarks will be released to support further research.

PDF HTML Abstract

Essay on "MMSci: A Multimodal Multi-Discipline Dataset for PhD-Level Scientific Comprehension"

This paper presents MMSci, a novel dataset meticulously curated to facilitate the evaluation and enhancement of Large Multimodal Models (LMMs) in comprehending advanced, multimodal scientific literature. This dataset encompasses peer-reviewed articles and figures from 72 distinct scientific disciplines, making it both diverse and robust for rigorous assessments of LMM capabilities.

The motivation for MMSci stems from the rapid advancements in LLMs and LMMs, which, while successful at elementary to undergraduate-level tasks, often falter when tasked with understanding PhD-level scientific content. MMSci addresses this gap by providing not only a challenging evaluation benchmark but also substantial training resources to enhance model performance.

Dataset and Benchmark Construction

The MMSci dataset was gathered from high-quality, open-access articles published in Nature Communications journals, ensuring authenticity and scholarly reliability. The dataset spans five major categories, including diverse subjects like materials science, ecology, molecular biology, and more. The collected data, which includes titles, abstracts, full articles, figures, and captions, underwent meticulous regular expression matching to accurately segment sub-figures and their corresponding sub-captions from complex multipanel figures.

In addition to the dataset, a comprehensive benchmark was constructed to evaluate LMMs rigorously. This benchmark comprises two primary tasks: Scientific Figure Captioning and Visual Question Answering (VQA), each with multiple settings to test various aspects of model comprehension:

Ungrounded, Abstract-grounded, and Full-content-grounded figure captioning: Models generate captions with varying degrees of contextual information.
Multiple-choice VQA settings: Models select correct captions or sub-captions from figures, testing their understanding of both figures and context.

Evaluation Results

The evaluation of prevalent open-source and proprietary LMMs reveals significant insights:

Scientific Figure Captioning: Models provided with full article context (GPT-4o) achieved the best METEOR and ROUGE scores, highlighting the necessity of comprehensive context for accurate figure interpretation. However, open-source models like LLaVA-Next demonstrated markedly lower performance, underscoring the challenges inherent in this task.
VQA Performance: Significant disparities were evident between models, with proprietary models (e.g., GPT-4V, GPT-4o) outperforming open-source counterparts, particularly when employing Chain-of-Thought (CoT) reasoning, which enhanced model accuracy by a substantial margin.

Training Resources and Enhancements

To address identified deficiencies, the authors explored the MMSci dataset as a training resource:

Visual Instruction-Following Data: This dataset is constructed to discuss figure content through a series of single or multi-turn interactions, reflecting real-world conversations about scientific figures.
Interleaved Text and Image Data for Pre-training: Articles and figures are interleaved to create a cohesive training corpus. Fine-tuning models on this dataset (e.g., 7B LLaVA model) yielded performance enhancements comparable to proprietary models like GPT-4V.

Case Study on Material Generation

A highlight of the paper is the case paper demonstrating the efficacy of continuous pre-training on MMSci. Utilizing this approach, the LLaMA2-7B model demonstrated improved stability and validity in generating novel crystal structures, essential tasks in materials science. This signifies the benefit of scientifically enriched training data, infusing the model with domain-specific knowledge that enhances its generative capabilities.

Implications and Future Directions

The implications of this research are manifold. Practically, MMSci enables the development of more capable and reliable AI assistants for scientific research, potentially automating parts of the research process such as literature review and data analysis. Theoretically, it provides insights into the integration of multimodal data within AI systems, furthering our understanding of how these systems can interpret and generate scientific content.

Future research directions could involve expanding the dataset to include more diverse forms of scientific content, such as supplementary materials and experimental datasets, or refining the evaluation metrics to capture nuanced aspects of model performance. The development of methods to seamlessly integrate multimodal pre-training with downstream task fine-tuning will also be pivotal.

Conclusion

MMSci stands as a significant contribution to the field of scientific AI, providing both a rigorous evaluation benchmark and valuable training resources. It bridges the gap in current model evaluations by focusing on PhD-level content and offers a path towards enhancing LMM capabilities in comprehending complex scientific literature. This work underscores the necessity of context-rich, diverse datasets in developing advanced AI solutions for academic and scientific endeavors.