NEPMaker: Active learning of neuroevolution machine learning potential for large cells

Published 15 Apr 2026 in physics.comp-ph | (2604.13848v1)

Abstract: Machine learning potentials (MLPs) achieve near first-principles accuracy but often fail for atomic environments outside the training distribution. Active learning can mitigate this limitation; however, its application to large-scale simulations is hindered by the prohibitive cost of labeling entire configurations. Here, we develop a D-optimality-driven active learning framework for the neuroevolution potential (NEP) implemented within the GPUMD package, named NEPMaker. Extrapolative atomic environments are identified on-the-fly and embedded into locally periodic structures, where boundary atoms are optimized to remain close to the training distribution. This strategy enables large-scale simulations to directly contribute to dataset construction, significantly reducing extrapolation errors while improving model robustness and transferability. The proposed framework provides a scalable route for constructing reliable machine learning potentials in complex materials systems, including those involving defects, interfaces, and phase transitions.

Abstract PDF Upgrade to Chat

Authors (7)

Summary

The paper introduces NEPMaker, which applies D-optimality active learning to construct robust NEP potentials for large-scale molecular dynamics simulations.
It integrates descriptor-based NEP models with the Max Vol algorithm and uncertainty quantification to efficiently extract high-value atomic environments.
Results on systems such as sodium, CsPbI3, and GaN show accurate phase transitions and significant force RMSE reductions, demonstrating practical scalability.

NEPMaker: Active Learning of Neuroevolution Machine Learning Potentials for Large Cells

Overview and Motivation

The paper presents NEPMaker, a D-optimality-driven active learning framework tailored to the neuroevolution potential (NEP) and integrated into the GPUMD package for efficient molecular dynamics (MD) simulations on large systems. The core motivation addresses the challenge of reliable uncertainty quantification (UQ) and extrapolation detection in machine learning potentials (MLPs), especially when scaling towards complex, large-scale materials such as those with defects, interfaces, or undergoing phase transitions. Conventional MLPs exhibit near DFT-level accuracy within the training regime but are susceptible to significant errors outside this domain, rendering extrapolation control critical for robust MD.

Methodology

NEP Model and Descriptor Construction

The NEP model parameterizes local atomic environments via invariant descriptor vectors (radial and angular), processed by a single-layer neural network to predict per-atom energies. This structure is designed to respect chemical species and spatial relationships, ensuring invariance under symmetry operations.

D-Optimality-Based Active Learning

D-optimality is employed for UQ, quantifying the informational span of the training set in descriptor space by maximizing the determinant of the design matrix. For linear potentials, the extrapolation grade $y$ directly reflects whether an environment is interpolative or extrapolative. This is generalized for NEP’s nonlinear architecture, utilizing the Max Vol algorithm to select the most informative active set of descriptors, efficiently implemented on GPUs.

For multi-element systems, atomic environments are partitioned by species, mitigating scale imbalances in the selection process and ensuring representative sampling across chemical diversity.

Local Fragment Extraction in Large-Scale Simulations

The framework circumvents the prohibitive cost of labeling entire supercells by extracting high-uncertainty local fragments. These fragments are embedded in periodic cells, and the boundary atoms are optimized via an uncertainty-minimizing objective that aligns surrounding environments with the trained distribution. This strategy is an improvement over cutout (vacuum) or random lattice embedding approaches, yielding physically meaningful atomic configurations and reducing energy assignment ambiguity. NEPMaker's optimization prevents boundary atoms from contaminating the training set with out-of-distribution environments.

Active Learning Workflow

The NEPMaker workflow proceeds as follows:

Initialization of the NEP model using a small, manually curated or previously available dataset.
Selection of an active set via Max Vol from the current training data.
MD simulation under target conditions, with online extrapolation detection through the D-optimality extrapolation grade.
Merging, refinement, and boundary optimization of extrapolative local environments extracted from large supercells prior to labeling with DFT.
Iterative execution of steps 1–4 until no further extrapolative environments are encountered.

Numerical Results

Sodium Melting

The active learning loop converges after seven iterations, resulting in 130 training structures and reducing the force RMSE to 10.38 meV/Å. The melting point obtained (350 K) is within 20 K of the experimental value (370 K), demonstrating the framework's efficacy for simple metals.

CsPbI $_3$ Phase Transitions

For CsPbI $_3$ , a highly anharmonic perovskite with multiple solid-solid transitions, NEPMaker achieves a converged potential after 23 iterations and 401 structures. Large-cell MD (23,040 atoms) captures both orthorhombic-to-tetragonal (280 K) and tetragonal-to-cubic (400 K) transitions, reproducing known transition sequences and lattice parameter evolutions. The force RMSE converges to 43.38 meV/Å.

GaN B4–B1 Phase Transition

The B4–B1 reconstructive phase transition in GaN, characterized by pronounced size effects and multiple nucleation pathways, was simulated using metadynamics on supercells up to 27,648 atoms. NEPMaker enables active learning directly within these extended simulations, extracting and labeling extrapolative environments on-the-fly. The resulting NEP models capture both interface propagation and metastable five-fold coordination intermediary phases, consistent with previous studies but now accessible at significantly larger scales. Final force errors were 274.44 meV/Å for GaN.

Implications and Future Directions

The NEPMaker framework establishes a scalable, physically consistent route for uncertainty-controlled construction of MLPs in large-scale, complex materials systems. Its integration of D-optimality, fragment extraction, and boundary optimization allows on-the-fly potential refinement within ongoing MD simulations, providing robust transferability across diverse phenomena (melting, phase transitions, defect evolution).

Practically, NEPMaker enables efficient sampling and labeling of relevant atomic environments without the computational infeasibility of full supercell DFT, resulting in highly accurate, stable potentials for systems previously considered intractable. Theoretical implications include improved generalization in ML-driven atomistic modeling and systematic reduction of extrapolation-induced errors.

Future developments may focus on further automating fragment extraction, incorporating advanced uncertainty measures, and benchmarking NEPMaker across broader classes of materials and more exotic structural transformations. Advances in descriptor diversity and neural architectures, combined with NEPMaker's workflow, may propel scalable, reliable MD simulations into regimes of chemical complexity and scale previously unreachable.

Conclusion

NEPMaker introduces a principled, D-optimality-driven active learning workflow for NEP-based MLP construction in large-scale materials modeling. By integrating uncertainty quantification, fragment extraction with boundary optimization, and iterative dataset refinement, NEPMaker delivers robust, transferable potentials validated across metallic, perovskite, and semiconductor systems. The method offers a generalizable and efficient solution for on-the-fly, automated construction of MLPs with first-principles fidelity and uncertainty control, advancing the state of computational materials science for dynamic, large-scale systems.

For further details, source code can be accessed at the NEPMaker GitLab repository.

(2604.13848)

Markdown Report Issue