A Survey of AI for Materials Science: Foundation Models, LLM Agents, Datasets, and Tools (2506.20743v1)

Published 25 Jun 2025 in cs.LG and cs.CE

Abstract: Foundation models (FMs) are catalyzing a transformative shift in materials science (MatSci) by enabling scalable, general-purpose, and multimodal AI systems for scientific discovery. Unlike traditional machine learning models, which are typically narrow in scope and require task-specific engineering, FMs offer cross-domain generalization and exhibit emergent capabilities. Their versatility is especially well-suited to materials science, where research challenges span diverse data types and scales. This survey provides a comprehensive overview of foundation models, agentic systems, datasets, and computational tools supporting this growing field. We introduce a task-driven taxonomy encompassing six broad application areas: data extraction, interpretation and Q&A; atomistic simulation; property prediction; materials structure, design and discovery; process planning, discovery, and optimization; and multiscale modeling. We discuss recent advances in both unimodal and multimodal FMs, as well as emerging LLM agents. Furthermore, we review standardized datasets, open-source tools, and autonomous experimental platforms that collectively fuel the development and integration of FMs into research workflows. We assess the early successes of foundation models and identify persistent limitations, including challenges in generalizability, interpretability, data imbalance, safety concerns, and limited multimodal fusion. Finally, we articulate future research directions centered on scalable pretraining, continual learning, data governance, and trustworthiness.

Summary

The paper demonstrates how foundation models provide a versatile framework for simulating materials properties and designing novel materials.
It details the integration of LLM agents to automate experimental workflows and optimize process planning in materials research.
The study identifies key dataset limitations and computational challenges, urging collaborative efforts to advance AI-driven materials science.

A Survey of AI for Materials Science: Foundation Models, LLM Agents, Datasets, and Tools

Introduction to AI in Materials Science

The advent of AI has catalyzed a transformative shift in the field of materials science (MatSci), particularly through the innovation of foundation models (FMs). These models offer scalable, general-purpose, and multimodal AI systems that transcend traditional task-specific machine learning approaches. The versatility of FMs is particularly conducive to addressing the diverse research challenges inherent in materials science, which span myriad data types and scales. The survey covered herein provides an expansive overview of the application of foundation models, LLM agents, datasets, and computational tools in this evolving domain.

Figure 1: Overview of our survey of AI for materials science (AI4MS), highlighting common tasks, categories of foundation models, datasets, tools and infrastructures, as well as key discussions on early successes, current limitations, challenges, and future directions.

Foundation Models: Revolutionizing Materials Science

Types and Applications of Foundation Models

Foundation Models (FMs) are large-scale, pretrained models that generalize across diverse downstream tasks, facilitating cross-domain applications with minimal fine-tuning. Key application areas within materials science include:

Data Extraction, Interpretation, and Q&A: FMs streamline the extraction of structured data from scientific literature, enabling knowledge graph construction from unstructured sources such as research papers and patents.
Atomistic Simulation: FMs trained on extensive datasets serve as universal simulators, offering near-DFT accuracy in predicting energies and forces across a variety of chemical systems.
Property Prediction: These models predict electronic, mechanical, thermal, optical, and chemical properties based on structural data, extending their capabilities across conventional boundaries within materials domains.
Materials Structure, Design, and Discovery: FMs empower generative design by learning inverse relationships between structures and properties, optimizing material design processes for specific attributes or objectives.
Process Planning and Optimization: Inaugurations such as autonomous laboratories illustrate how FMs are used to automate and optimize experimental procedures—guiding synthesis and procedural operations under real-world constraints.
Multiscale Modeling: Beyond atomic interactions, FMs hold potential for modeling behavior across scales, uniting atomistic insights with macroscopic performance metrics.
Figure 2: An illustrative example of the interplay of foundation models for materials science with data types and modalities.

Challenges and Limitations

While the successes of FMs in materials science are numerous, several issues persist, limiting their wider adoption and effectiveness:

Generalization and Interactivity: The adaptability of FMs across novel chemical spaces remains a challenge, as does their integration with experimental outputs, especially with imposed physical laws and synthesis constraints.
Data Limitations: The availability and diversity of training data is a major obstacle, with current datasets largely biased towards stable, inorganic phases while underrepresenting polymers and biomaterials.
Computational Resources: The demand for high computational power and infrastructure for training large-scale models poses accessibility issues, often limiting engagement to well-funded research institutions.
Figure 3: Development of AI in materials science over time: foundation models, datasets, and tools and infrastructure.

Emerging Role of LLM-Based Agents

LLM-based agents represent the next evolutionary step in the integration of AI into materials science. These agents utilize the reasoning capabilities of LLMs to facilitate automated discovery and experimental workflows. Prominent examples include HoneyComb, LLMatDesign, and MatAgent, all of which demonstrate advances in autonomous materials discovery and synthesis planning. Nonetheless, these systems also face challenges akin to FM applications, compounded by additional concerns related to biosafety, experimental validity, and human oversight integration.

Datasets and Tools for Foundation Models

The development of foundation models relies on expansive, high-quality datasets that span a variety of materials types, and computational and experimental modalities. Datasets such as the Materials Project, Open Catalyst 2020, and QM9 offer rich repositories of atomic structures and properties. Complementing these datasets are tools like Pymatgen and Open MatSci ML Toolkit, which provide efficient conduits for data processing and model development. However, enhancing access to comprehensive and diverse data sets remains critical for advancing the scalability and practicality of AI-driven solutions in materials science.

Conclusion

The survey underscores the transformative potential of foundation models and LLM agents in redefining materials science research. While substantial progress has been made, achieving full integration of AI into materials science necessitates addressing computational, data, and methodological limitations. Collaborative efforts towards democratizing access to data and computational resources, alongside refining model architectures and training paradigms, will be pivotal in advancing this interdisciplinary frontier. These endeavors will ultimately foster the development of more robust, versatile, and widely applicable AI systems poised to enhance material design and discovery processes markedly.