MD-HIT: Machine learning for materials property prediction with dataset redundancy control (2307.04351v1)
Abstract: Materials datasets are usually featured by the existence of many redundant (highly similar) materials due to the tinkering material design practice over the history of materials research. For example, the materials project database has many perovskite cubic structure materials similar to SrTiO$_3$. This sample redundancy within the dataset makes the random splitting of machine learning model evaluation to fail so that the ML models tend to achieve over-estimated predictive performance which is misleading for the materials science community. This issue is well known in the field of bioinformatics for protein function prediction, in which a redundancy reduction procedure (CD-Hit) is always applied to reduce the sample redundancy by ensuring no pair of samples has a sequence similarity greater than a given threshold. This paper surveys the overestimated ML performance in the literature for both composition based and structure based material property prediction. We then propose a material dataset redundancy reduction algorithm called MD-HIT and evaluate it with several composition and structure based distance threshold sfor reducing data set sample redundancy. We show that with this control, the predicted performance tends to better reflect their true prediction capability. Our MD-hit code can be freely accessed at https://github.com/usccolumbia/MD-HIT
- Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics, 22(13):1658–1659, 2006.
- Crystal graph convolutional neural networks for an accurate and interpretable prediction of material properties. Physical review letters, 120(14):145301, 2018.
- Machine learning models for the lattice thermal conductivity prediction of inorganic materials. Computational Materials Science, 170:109155, 2019.
- Moving closer to experimental level materials property prediction using ai. Scientific reports, 12(1):1–9, 2022.
- Enhancing materials property prediction by leveraging computational and experimental data using deep transfer learning. Nature communications, 10(1):5316, 2019.
- Fast and stable deep-learning predictions of material properties for solid solution alloys. Journal of Physics: Condensed Matter, 33(8):084005, 2020.
- Graph networks as a universal machine learning framework for molecules and crystals. Chemistry of Materials, 31(9):3564–3572, 2019.
- Prediction errors of molecular machine learning models lower than hybrid dft error. Journal of chemical theory and computation, 13(11):5255–5264, 2017.
- What information is necessary and sufficient to predict materials properties using machine learning? arXiv preprint arXiv:2206.04968, 2022.
- Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure. Ecography, 40(8):913–929, 2017.
- A critical examination of robustness and generalizability of machine learning prediction of materials properties. npj Computational Materials, 9(1):55, 2023.
- Can machine learning identify the next high-temperature superconductor? examining extrapolation performance for materials discovery. Molecular Systems Design & Engineering, 3(5):819–825, 2018.
- Machine learning modeling of superconducting critical temperature. npj Computational Materials, 4(1):29, 2018.
- Evaluating explorative prediction power of machine learning algorithms for materials discovery using k-fold forward cross-validation. Computational Materials Science, 171:109203, 2020.
- Lattice thermal conductivity prediction using symbolic regression and machine learning. The Journal of Physical Chemistry A, 125(1):435–450, 2020.
- Learning from mistakes: Sampling strategies to efficiently train machine learning models for material property prediction. Computational Materials Science, 224:112167, 2023.
- On the redundancy in large material datasets: efficient and robust learning with less data. arXiv preprint arXiv:2304.13076, 2023.
- Cd-hit: accelerated for clustering the next-generation sequencing data. Bioinformatics, 28(23):3150–3152, 2012.
- A general-purpose machine learning framework for predicting properties of inorganic materials. npj Computational Materials, 2(1):1–7, 2016.
- Predicting materials properties without crystal structure: Deep representation learning from stoichiometry. Nature communications, 11(1):6280, 2020.
- Compositionally restricted attention-based network for materials property predictions. Npj Computational Materials, 7(1):77, 2021.
- Atomistic line graph neural network for improved materials property predictions. npj Computational Materials, 7(1):185, 2021.
- Scalable deeper graph neural networks for high-performance materials property prediction. Patterns, 3(5):100491, 2022.
- Benchmarking materials property prediction methods: the matbench test set and automatminer reference algorithm. npj Computational Materials, 6(1):138, 2020.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.