MatQnA: Multi-Modal Benchmark Dataset

Updated 19 September 2025

MatQnA is the first multi-modal benchmark dataset designed to assess LLMs’ capability in interpreting experimental materials data from text and image inputs.
It compiles over 5,000 QA pairs derived from 400+ peer-reviewed articles to support rigorous evaluation of materials characterization techniques.
The dataset facilitates applications like property prediction and AI-driven materials discovery by offering a standardized resource for model validation.

MatQnA is the first multi-modal benchmark dataset designed to systematically evaluate the capabilities of LLMs in materials characterization and analysis. It provides a standardized resource for testing whether state-of-the-art multi-modal models can interpret experimental materials data and extract domain-specific knowledge by integrating both text and image modalities. Derived from over 400 peer-reviewed journal articles and expert case studies, MatQnA enables rigorous assessments of AI systems in supporting materials research workflows, from property prediction to materials discovery. The dataset is publicly accessible at https://huggingface.co/datasets/richardhzgg/matQnA (Weng et al., 14 Sep 2025).

1. Design Objectives and Scope

MatQnA targets comprehensive validation of LLMs in the specialized domain of materials characterization. The dataset was constructed to fill a critical gap in AI benchmarking, focusing on deeper scientific reasoning associated with experimental data interpretation. Its primary aim is to evaluate model performance in real-world materials scenarios, requiring the understanding of technical concepts and the integration of image and text information.

Curated from a diverse corpus—including materials science literature and expert analyses—the dataset covers a wide spectrum of characterization techniques. The questions are designed around experimental figures, spectral patterns, microscopy images, and domain-specific data tables, reflecting the complexity encountered in scientific practice.

2. Covered Characterization Techniques

MatQnA focuses on ten major methods central to materials science, each of which presents unique multimodal challenges:

Technique	Key Analytical Focus	Modality
XPS	Chemical state, element, peak assignment	Image, Text
XRD	Crystal structure, phase, grain sizing	Image, Text
SEM	Surface morphology, defects	Image
TEM	Internal lattice, microstructure	Image
AFM	3D topography, roughness	Image
DSC	Thermal transitions, enthalpy	Chart
TGA	Decomposition, stability	Chart
FTIR	Bonds, vibrational modes	Spectrum
Raman	Molecular vibration, phase composition	Spectrum
XAFS	Atomic environment, oxidation states	Spectrum

Quantitative analysis tasks commonly appear, such as use of the Scherrer equation for XRD grain size estimation: $L = \frac{K\lambda}{\beta \cos \theta}$ where $L$ is crystallite size, $\lambda$ the X-ray wavelength, $\beta$ peak width, and $\theta$ the Bragg angle.

Each technique’s section contains domain-relevant figures (e.g., XPS binding energy plots, SEM micrographs) paired with structured questions.

3. Dataset Construction Methodology

The MatQnA dataset was assembled through a hybrid methodology combining automated LLM-based question generation and expert human validation:

Source Extraction: Raw data are primarily PDFs from journal articles and domain case reports, preprocessed using PDF Craft to isolate relevant text, images, and figure captions.
Automated QA Generation: Structured prompt templates and OpenAI's GPT-4.1 API are used to draft multi-format (objective and open-ended) questions. Automatic coreference handling and context enforcement ensure clarity, especially for image-based queries.
Human Validation: Domain experts review, filter, and correct the generated QA pairs for terminological precision and logical relevance. Regex-based methods help enforce answer self-containment.
Dataset Structure: Data are organized by characterization technique, resulting in ~5,000 QA pairs (2,749 subjective, 2,219 objective) stored in Parquet format. Each entry is explicitly linked to its associated technique.

This interplay between automated and manual processes suggests robust quality control and contextual fidelity, critical for high-stakes scientific evaluation.

4. Question Formats and Evaluation Procedures

MatQnA’s QA pairs are divided into two main categories:

Multiple-Choice Questions (MCQs): Objective, closed-form items designed for unambiguous grading. MCQs focus on factual recognition, calculation, or discrete judgment from presented experimental data.
Subjective Questions: Open-ended prompts requiring detailed explanation, justification, or synthesis. Subjective evaluation emphasizes models’ ability to express scientific reasoning and communicate technical concepts.

Both formats are used to diagnose model competence in image interpretation, quantitative analysis, and domain-specific nomenclature.

Scoring protocols for objective questions are standardized; subjective items rely on expert rubric review. Preliminary results on objective tasks reveal nearly 90% accuracy for top models such as GPT-4.1 (89.8%) and Claude Sonnet 4, with technique-specific performance ranging from 83.9% (AFM) to 95%+ (FTIR, Raman).

5. Model Performance and Analytical Insights

State-of-the-art multi-modal LLMs—including GPT-4.1, Claude, Gemini 2.5, and Doubao Vision Pro 32K—demonstrate strong proficiency in materials data interpretation:

Nearly 90% accuracy in MCQ-based evaluation across all techniques.
The highest scores are in spectroscopic characterization (FTIR, Raman).
Lower performance in techniques requiring spatial reasoning or 3D topology analysis (e.g., AFM).
Heatmap analyses across 31 subcategories confirm systematic strengths and weaknesses.

This suggests that while current models are highly adept at standard data interpretation, there exist specific modalities (especially spatial) requiring further algorithmic innovation.

6. Scientific and Applied Impact

MatQnA provides a resource for diverse applications:

Benchmarking and Model Selection: Rigorous, standardized foundation for evaluating LLMs in materials science.
Workflow Integration: Enables AI-assisted materials discovery, property prediction, and experimental support.
Domain-Specific Model Development: Facilitates targeted fine-tuning and robust analysis of multi-modal AI systems.
Interdisciplinary Expansion: Demonstrates feasibility of extending LLM-based evaluation frameworks to other specialized fields.

A plausible implication is acceleration of AI deployment in laboratory environments, supporting both routine interpretation and advanced discovery tasks.

7. Access and Ongoing Development

MatQnA is freely available to the research community through the Hugging Face repository: https://huggingface.co/datasets/richardhzgg/matQnA. Researchers are encouraged to utilize, evaluate, and iteratively improve the dataset. The presence of robust validation and comprehensive coverage positions MatQnA as a reference standard for future work in multi-modal AI benchmarking within scientific domains.

Conclusion

MatQnA establishes the first systematic, multi-modal benchmark in materials characterization for LLMs, combining automated QA generation, expert validation, and comprehensive scientific coverage. It enables precise evaluation of AI capabilities in interpreting experimental data and supports broader efforts in AI-driven scientific research (Weng et al., 14 Sep 2025).

PDF Markdown Chat (Pro)

References (1)

MatQnA: A Benchmark Dataset for Multi-modal Large Language Models in Materials Characterization and Analysis (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to MatQnA Dataset.