Dice Question Streamline Icon: https://streamlinehq.com

Efficiency of training sets for neural network interatomic potentials

Determine whether training sets for neural network interatomic potentials can be made more efficient while achieving similar or better results than training with large amounts of data.

Information Square Streamline Icon: https://streamlinehq.com

Background

In atomistic machine learning, generating high-quality training data is computationally expensive and larger datasets increase training costs. While bigger datasets can improve generalization for neural network interatomic potentials (NNIPs), the paper questions whether equivalent or superior performance can be achieved with smaller, more efficient training sets.

This uncertainty motivates the development of principled methods to quantify dataset information content and redundancy, with the goal of minimizing dataset size without sacrificing coverage of the relevant configuration space.

References

Furthermore, while training models on large amounts of data can enhance the generalization power of NNIPs, it is still unclear whether training sets can be made more efficient while achieving similar or better results.

Model-free quantification of completeness, uncertainties, and outliers in atomistic machine learning using information theory (2404.12367 - Schwalbe-Koda et al., 18 Apr 2024) in Results, Subsection "Information-theoretical dataset analysis for machine learning potentials"