Materials Informatics Across the Length Scales

Published 20 Apr 2026 in cond-mat.mtrl-sci | (2604.18086v1)

Abstract: Materials informatics is increasingly used to support modelling, analysis and design across the length scales of materials science, from atomistic simulations to microstructural characterisation and continuum descriptions. Despite rapid progress, the reliability and transferability of these approaches vary strongly with scale. Here we survey data-driven methods at the nanoscale, mesoscale, and micro-to-continuum levels, highlighting established capabilities as well as unresolved challenges. Machine-learning interatomic potentials, mesoscale surrogate and operator-learning models, and learning-based analysis of experimental microstructures are discussed, with emphasis on data quality, uncertainty, interpretability, and cross-scale consistency. We further examine the role of data standards, ontologies, and emerging tools, such as autonomous laboratories, where they directly affect multiscale workflows. This perspective clarifies what can be considered reliable today and identifies key obstacles to the broader integration of materials informatics across scales.

Abstract PDF Upgrade to Chat

Authors (11)

Summary

The paper shows ML interatomic potentials achieve 3–5 orders of magnitude acceleration over DFT while maintaining near first-principles accuracy.
It details surrogate modeling and deep learning architectures that accurately simulate mesoscale and micro-to-continuum phenomena.
The study emphasizes the need for interoperable ontologies and LLM-enabled agents to unify scale-specific data in materials science.

Materials Informatics Across the Length Scales: An Expert Synthesis

Introduction

The manuscript "Materials Informatics Across the Length Scales" (2604.18086) provides a systematic analysis of the application of ML and informatics methodologies throughout the spatial hierarchy of materials science, critically examining the nanoscale, mesoscale, and micro-to-continuum regimes. The authors dissect advances in data-driven modeling, highlight persistent obstacles in transferability and interpretability, and evaluate the emerging role of ontological frameworks and LLMs in bridging scale-specific communities. The review is anchored by detailed exemplars, rigorous cross-domain analysis, and forward-looking recommendations.

Nanoscale Informatics: Data-Driven Atomistic Modelling

At the nanoscale, the authors focus on ML interatomic potentials (MLIPs) as the centerpiece for propagating quantum-mechanical accuracy into large and dynamic atomic systems. A particular emphasis is placed on symmetry-equivariant and high-dimensional neural network potentials that achieve near first-principles fidelity over long molecular dynamics trajectories, with MLIPs now instantiating a four-generation taxonomy distinguishing locality, electrostatics, and nonlocal charge transfer mechanisms.

The integration of deep learning with high-resolution experimental modalities is also addressed. Automated analysis pipelines for HAADF-STEM imaging have replaced subjective, manual data interpretation, introducing robust, high-throughput pipelines for atomic column localization and classification.

Figure 1: Deep-learning workflow for localization and classification of atomic columns in HAADF-STEM images, enabling quantitative analysis under challenging experimental conditions.

MLIPs are further scrutinized through their application to complex surface reconstructions (e.g., Si(111)-7×7) and size-dependent stability of nanoparticles, where traditional empirical potentials fail to recover energetics or capture nontrivial reconstructions. By contrast, state-of-the-art SOAP-GAP models enable correct energetic ordering and reproduce subtle electronic effects.

Figure 2: MLIP-based predictions for silicon surface reconstructions and Au nanoparticle melting illustrate the failure modes of classical force fields and the superior accuracy of MLIP approaches.

The necessity for explicit modeling of nonlocal electrostatics is validated through side-by-side comparison of short-range and electrostatics-augmented GAP models for ionic materials, examining Li-ion migration in disordered solid-state electrolytes. Results demonstrate that, in bulk systems, purely local models suffice, but explicit treatment of long-range interactions is imperative in interfacial and defect-rich environments.

Figure 3: ML potentials with explicit nonlocal electrostatics capture phenomena inaccessible to purely local models, such as charge redistribution in polar slabs and defect-mediated ion transport.

Highlighted findings: MLIPs deliver 3–5 orders-of-magnitude acceleration over DFT for supported nanoparticle systems without loss in accuracy, and advanced surrogates show below 10% macro-average relative error in long-horizon rollout predictions for realistic mesoscale phenomena.

Mesoscale and Micro-to-Continuum Learning

The mesoscale regime is addressed through surrogate modeling, operator learning, and representation compression techniques that enable tractable, data-efficient simulations of complex microstructure evolution—critical for real-world phenomena such as spinodal decomposition, domain switching in ferroics, and metamaterial optimization.

The authors showcase surrogate architectures capable of highly accurate long-horizon rollouts for complex field evolutions under time-dependent boundary conditions, substantiating these claims with direct comparisons to full-field phase-field simulations.

Figure 4: Surrogate models roll out mesoscale ferroelectric switching dynamics with high fidelity and computational efficiency, essential for inverse design and optimization workflows.

Key challenges are identified, including persistent data scarcity for high-dimensional experimental microstructures, domain shift between experimental and computational datasets, and systematic difficulties in uncertainty quantification and extrapolation.

At the micro-to-continuum scale, advances in deep CNN- and GNN-based frameworks for quantitative microstructure segmentation, classification, and property mapping are outlined. These frameworks routinely outperform classic approaches, support objective property prediction, and facilitate transfer learning applications, illuminating the role of data modality and representation for hierarchical materials response.

Ontologies and Standards: Semantic Integration Across Scales

A central theme is the requirement for semantic and ontological integration to enable robust interoperability across the disparate data representations and conceptualizations endemic to different length scales. The review exhaustively discusses how scale-specific communities have independently developed incompatible notions of materials entities and properties (e.g., “molecule” in physics vs. chemistry).

Figure 5: Distinct entities—electrons/atoms (nanoscale), beads (mesoscale), and continuum volumes (microscale)—anchor scale-specific models.

Figure 6: Illustrative conflict between typical chemistry and physics definitions of the molecule.

Figure 7: The EMMO ontology reconciles disparate definitions by mapping them into a coherent multiperspective framework.

The Elementary Multidisciplinary Material Ontology (EMMO) is cited as the canonical approach to formalized, logic-grounded ontologies, providing rigorous mechanisms for semantic reconciliation, property alignment, and scalable knowledge exchange. EMMO’s mereocausality foundations and dual backbone/perspective branches are detailed, highlighting how these support interoperability from quantum through continuum levels.

Figure 8: EMMO's architecture combines causality/mereology with scientific and application-oriented perspectives.

LLMs: Foundation Models and Autonomous Agents

The review identifies the emergence of LLMs as a transformative development for materials informatics, noting their flexibility in domain adaptation, information extraction, generative tasks, and workflow automation. Foundation models such as CrystaLLM, trained on tokenized crystallographic data, are shown to provide robust autoregressive generation of physically valid structures.

Figure 9: CrystaLLM's autoregressive modeling process for crystallographic data, supporting generative and fine-tuned applications.

LLM agent systems, integrating domain-specific toolsets (e.g., ChemCrow), are described as enabling adaptive, sequential reasoning over heterogeneous databases, simulation interfaces, and experimental planning, highlighting how LLMs coordinate with legacy software to autonomously execute complex, multi-stage discovery pipelines.

Figure 10: Schematic of an LLM agent orchestrating chemical synthesis planning, using stepwise autonomous tool invocation and reasoning.

Critical evaluation practices and the need for stringent benchmarking are underscored, particularly in light of the risk of "phantom progress" from non-representative metrics and poor construct validity.

Implications and Outlook

This work underscores that, while scale-specific ML and informatics methods yield powerful in-domain results, the major obstacle to holistic materials discovery remains in robust scale-bridging. Semantic standardization (e.g., via EMMO), open FAIR data infrastructures, and the seamless integration of foundation models and agent-based LLMs are identified as enabling pillars for future unification.

Practical implications include dramatic acceleration of hypothesis generation, synthesis planning, and microstructural analysis, while both theoretical and methodological advances are expected from further convergence of physics-informed ML, uncertainty quantification, and ontological reasoning.

The authors suggest that progressing toward a unified, data-centric materials ecosystem will require continued cross-scale community collaboration, development of flexible and extensible multi-scale AI pipelines, and shared infrastructure incorporating both data and semantic standards.

Conclusion

The review presents an authoritative, analytic synthesis of materials informatics progress and obstacles across the length scales. It argues that future innovation in materials science will hinge not only on more accurate or efficient scale-specific models, but primarily on the establishment of interoperable data protocols, logic-based ontological frameworks, and intelligent AI agents capable of orchestrating end-to-end, cross-scale materials design. The integration of LLMs and semantic standards promises to dissolve traditional compartmentalization, enabling a new paradigm of unified, collaborative, and hypothesis-driven materials discovery.

Markdown Report Issue