Representation of compounds for machine-learning prediction of physical properties (1611.08645v2)

Published 26 Nov 2016 in cond-mat.mtrl-sci

Abstract: The representations of a compound, called "descriptors" or "features", play an essential role in constructing a machine-learning model of its physical properties. In this study, we adopt a procedure for generating a systematic set of descriptors from simple elemental and structural representations. First it is applied to a large dataset composed of the cohesive energy for about 18000 compounds computed by density functional theory (DFT) calculation. As a result, we obtain a kernel ridge prediction model with a prediction error of 0.041 eV/atom, which is close to the "chemical accuracy" of 1 kcal/mol (0.043 eV/atom). The procedure is also applied to two smaller datasets, i.e., a dataset of the lattice thermal conductivity (LTC) for 110 compounds computed by DFT calculation and a dataset of the experimental melting temperature for 248 compounds. We examine the performance of the descriptor sets on the efficiency of Bayesian optimization in addition to the accuracy of the kernel ridge regression models. They exhibit good predictive performances.

Authors (5)

Atsuto Seko (34 papers)
Hiroyuki Hayashi (15 papers)
Keita Nakayama (1 paper)
Akira Takahashi (9 papers)
Isao Tanaka (75 papers)

Citations (200)

View on Semantic Scholar

Summary

Analysis of Compound Representations for Machine Learning in Materials Science

The paper "Representation of compounds for machine-learning prediction of physical properties" presents an in-depth exploration of ML methodologies in predicting the physical properties of compounds. Authored by Atsuto Seko et al., the research primarily focuses on the importance of selecting and refining descriptors—quantitative representations of compounds—that are integral to the efficacy of prediction models in materials science.

A cornerstone of this research is the formulation and validation of systematic descriptor sets, derived from elemental and structural data, to facilitate the ML prediction of cohesive energy, lattice thermal conductivity, and the melting temperature of compounds. This approach is applied across datasets of varying sizes, including a substantial dataset featuring cohesive energy values computed for approximately 18,000 compounds utilizing Density Functional Theory (DFT), in addition to smaller datasets evaluating lattice thermal conductivity (110 compounds) and melting points (248 compounds).

Key Numerical Results

One of the standout results of the research is the achievement of a kernel ridge prediction model with a prediction error of 0.041 eV/atom for cohesive energy, highlighting a notable proximity to what is deemed as "chemical accuracy" within the scientific domain. The innovative incorporation of descriptor generation leads to these enhanced accuracies without heavily relying on exhaustive data simulations.

Insights on Descriptor Generation

The methodological innovation lies in viewing representation matrices as data distributions within an n-dimensional space, which then are characterized by statistical metrics such as mean, standard deviation, and covariance to form descriptors. This approach allows for the consistent representation of diverse compounds irrespective of their compositional or structural differences—a significant challenge when employing ML in heterogeneous material datasets.

Their proposed method not only captures a compound's intrinsic attributes but also accounts for interactions between elemental and structural properties, as evidenced by the improvement in prediction error when employing structural descriptors derived from prototype structures.

Implications and Speculation on Future Directions

The implications of this research are profound in the context of computational material science. It underscores the potential for efficient compound discovery processes without the complete reliance on high-cost simulations such as DFT calculations. This advancement could lead to marked efficiencies in the exploration of new materials, thereby accelerating the discovery process in materials science disciplines.

Looking ahead, the integration of these descriptor generation techniques with large-scale ML approaches could enhance the predictive capabilities of models even further. There exists potential for future research to refine these models using emerging computational techniques like deep learning, which could allow for the discovery of complex correlations in high-dimensional datasets that are beyond the current scope of traditional ML and kernel methods.

Additionally, extending the applicability of these methodologies to non-crystalline and molecular compounds represents a promising avenue. As computational resources and ML frameworks evolve, it can be anticipated that this methodological framework will lay groundwork for diverse applications across materials technologies, developing more accurate predictive tools required for revolutionary advancements in materials science.

In summary, the paper provides a substantial contribution to the field by not only addressing the critical aspect of descriptor selection in ML models but also demonstrating the leverage of these techniques through significant numerical results across several datasets. The adaptability and efficacy of the proposed method underscore its potential role in advancing data-driven methodologies in materials exploration and design.