MatQnA: Multi-Modal Benchmark Dataset
- MatQnA is the first multi-modal benchmark dataset designed to assess LLMs’ capability in interpreting experimental materials data from text and image inputs.
- It compiles over 5,000 QA pairs derived from 400+ peer-reviewed articles to support rigorous evaluation of materials characterization techniques.
- The dataset facilitates applications like property prediction and AI-driven materials discovery by offering a standardized resource for model validation.
MatQnA is the first multi-modal benchmark dataset designed to systematically evaluate the capabilities of LLMs in materials characterization and analysis. It provides a standardized resource for testing whether state-of-the-art multi-modal models can interpret experimental materials data and extract domain-specific knowledge by integrating both text and image modalities. Derived from over 400 peer-reviewed journal articles and expert case studies, MatQnA enables rigorous assessments of AI systems in supporting materials research workflows, from property prediction to materials discovery. The dataset is publicly accessible at https://huggingface.co/datasets/richardhzgg/matQnA (Weng et al., 14 Sep 2025).
1. Design Objectives and Scope
MatQnA targets comprehensive validation of LLMs in the specialized domain of materials characterization. The dataset was constructed to fill a critical gap in AI benchmarking, focusing on deeper scientific reasoning associated with experimental data interpretation. Its primary aim is to evaluate model performance in real-world materials scenarios, requiring the understanding of technical concepts and the integration of image and text information.
Curated from a diverse corpus—including materials science literature and expert analyses—the dataset covers a wide spectrum of characterization techniques. The questions are designed around experimental figures, spectral patterns, microscopy images, and domain-specific data tables, reflecting the complexity encountered in scientific practice.
2. Covered Characterization Techniques
MatQnA focuses on ten major methods central to materials science, each of which presents unique multimodal challenges:
Technique | Key Analytical Focus | Modality |
---|---|---|
XPS | Chemical state, element, peak assignment | Image, Text |
XRD | Crystal structure, phase, grain sizing | Image, Text |
SEM | Surface morphology, defects | Image |
TEM | Internal lattice, microstructure | Image |
AFM | 3D topography, roughness | Image |
DSC | Thermal transitions, enthalpy | Chart |
TGA | Decomposition, stability | Chart |
FTIR | Bonds, vibrational modes | Spectrum |
Raman | Molecular vibration, phase composition | Spectrum |
XAFS | Atomic environment, oxidation states | Spectrum |
Quantitative analysis tasks commonly appear, such as use of the Scherrer equation for XRD grain size estimation: where is crystallite size, the X-ray wavelength, peak width, and the Bragg angle.
Each technique’s section contains domain-relevant figures (e.g., XPS binding energy plots, SEM micrographs) paired with structured questions.
3. Dataset Construction Methodology
The MatQnA dataset was assembled through a hybrid methodology combining automated LLM-based question generation and expert human validation:
- Source Extraction: Raw data are primarily PDFs from journal articles and domain case reports, preprocessed using PDF Craft to isolate relevant text, images, and figure captions.
- Automated QA Generation: Structured prompt templates and OpenAI's GPT-4.1 API are used to draft multi-format (objective and open-ended) questions. Automatic coreference handling and context enforcement ensure clarity, especially for image-based queries.
- Human Validation: Domain experts review, filter, and correct the generated QA pairs for terminological precision and logical relevance. Regex-based methods help enforce answer self-containment.
- Dataset Structure: Data are organized by characterization technique, resulting in ~5,000 QA pairs (2,749 subjective, 2,219 objective) stored in Parquet format. Each entry is explicitly linked to its associated technique.
This interplay between automated and manual processes suggests robust quality control and contextual fidelity, critical for high-stakes scientific evaluation.
4. Question Formats and Evaluation Procedures
MatQnA’s QA pairs are divided into two main categories:
- Multiple-Choice Questions (MCQs): Objective, closed-form items designed for unambiguous grading. MCQs focus on factual recognition, calculation, or discrete judgment from presented experimental data.
- Subjective Questions: Open-ended prompts requiring detailed explanation, justification, or synthesis. Subjective evaluation emphasizes models’ ability to express scientific reasoning and communicate technical concepts.
Both formats are used to diagnose model competence in image interpretation, quantitative analysis, and domain-specific nomenclature.
Scoring protocols for objective questions are standardized; subjective items rely on expert rubric review. Preliminary results on objective tasks reveal nearly 90% accuracy for top models such as GPT-4.1 (89.8%) and Claude Sonnet 4, with technique-specific performance ranging from 83.9% (AFM) to 95%+ (FTIR, Raman).
5. Model Performance and Analytical Insights
State-of-the-art multi-modal LLMs—including GPT-4.1, Claude, Gemini 2.5, and Doubao Vision Pro 32K—demonstrate strong proficiency in materials data interpretation:
- Nearly 90% accuracy in MCQ-based evaluation across all techniques.
- The highest scores are in spectroscopic characterization (FTIR, Raman).
- Lower performance in techniques requiring spatial reasoning or 3D topology analysis (e.g., AFM).
- Heatmap analyses across 31 subcategories confirm systematic strengths and weaknesses.
This suggests that while current models are highly adept at standard data interpretation, there exist specific modalities (especially spatial) requiring further algorithmic innovation.
6. Scientific and Applied Impact
MatQnA provides a resource for diverse applications:
- Benchmarking and Model Selection: Rigorous, standardized foundation for evaluating LLMs in materials science.
- Workflow Integration: Enables AI-assisted materials discovery, property prediction, and experimental support.
- Domain-Specific Model Development: Facilitates targeted fine-tuning and robust analysis of multi-modal AI systems.
- Interdisciplinary Expansion: Demonstrates feasibility of extending LLM-based evaluation frameworks to other specialized fields.
A plausible implication is acceleration of AI deployment in laboratory environments, supporting both routine interpretation and advanced discovery tasks.
7. Access and Ongoing Development
MatQnA is freely available to the research community through the Hugging Face repository: https://huggingface.co/datasets/richardhzgg/matQnA. Researchers are encouraged to utilize, evaluate, and iteratively improve the dataset. The presence of robust validation and comprehensive coverage positions MatQnA as a reference standard for future work in multi-modal AI benchmarking within scientific domains.
Conclusion
MatQnA establishes the first systematic, multi-modal benchmark in materials characterization for LLMs, combining automated QA generation, expert validation, and comprehensive scientific coverage. It enables precise evaluation of AI capabilities in interpreting experimental data and supports broader efforts in AI-driven scientific research (Weng et al., 14 Sep 2025).