Chemical Space-Informed Machine Learning Models for Rapid Predictions of X-ray Photoelectron Spectra of Organic Molecules (2405.20033v2)

Published 30 May 2024 in physics.chem-ph

Abstract: We present machine learning models based on kernel-ridge regression for predicting X-ray photoelectron spectra of organic molecules originating from the $K$-shell ionization energies of carbon (C), nitrogen (N), oxygen (O), and fluorine (F) atoms. We constructed the training dataset through high-throughput calculations of $K$-shell core-electron binding energies (CEBEs) for 12,880 small organic molecules in the bigQM7$\omega$ dataset, employing the $\Delta$-SCF formalism coupled with meta-GGA-DFT and a variationally converged basis set. The models are cost-effective, as they require the atomic coordinates of a molecule generated using universal force fields while estimating the target-level CEBEs corresponding to DFT-level equilibrium geometry. We explore transfer learning by utilizing the atomic environment feature vectors learned using a graph neural network framework in kernel-ridge regression. Additionally, we enhance accuracy within the $\Delta$-machine learning framework by leveraging inexpensive baseline spectra derived from Kohn--Sham eigenvalues. When applied to 208 combinatorially substituted uracil molecules larger than those in the training set, our analyses suggest that the models may not provide quantitatively accurate predictions of CEBEs but offer a strong linear correlation relevant for virtual high-throughput screening. We present the dataset and models as the Python module, ${\tt cebeconf}$, to facilitate further explorations.

Summary

The paper presents an ML framework that harnesses kernel ridge regression, transfer learning, and Δ-learning to predict XPS core-electron binding energies with high DFT-level accuracy.
It employs a robust dataset of 12,880 organic molecules and 85,837 atomic entries, achieving a mean absolute error of less than 0.1 eV in out-of-sample predictions.
The study offers a scalable, computationally efficient approach using universal force field geometries to bridge theoretical predictions with practical high-throughput screening.

Chemical Space-Informed Machine Learning Models for X-ray Photoelectron Spectra

This paper presents an advanced ML framework for predicting the X-ray photoelectron spectra (XPS) of organic molecules, specifically focusing on the core-electron binding energies (CEBEs) of carbon (C), nitrogen (N), oxygen (O), and fluorine (F). The authors leverage kernel-ridge regression (KRR) models, using a robust dataset of 12,880 small organic molecules with CEBEs calculated via the $\Delta$ -SCF method coupled with meta-generalized gradient approximation density functional theory (meta-GGA-DFT).

The primary dataset, known as -CEBECONF, comprises 85,837 entries for CONF atoms, employed to train atom-specific ML models. The authors incorporate transfer learning and $\Delta$ -learning approaches, enhancing the accuracy of predictions. Transfer learning is performed by utilizing atomic environment feature vectors derived from graph neural networks (GNNs), while the $\Delta$ -learning framework refines predictions using baseline Kohn–Sham eigenvalues.

A significant aspect of this work is the economical computation, achieved by relying on molecular geometries generated through universal force fields (UFF) rather than more computationally intensive methods. This approach makes predictions scalable across larger datasets without compromising the DFT-level accuracy significantly. Tested against SCAN/Tight-Full level calculations, the ML models achieve strong linear correlations between predicted and calculated CEBEs, which is noteworthy for virtual high-throughput screening despite not always yielding quantitatively accurate predictions on an individual molecular basis.

Technically, the paper illustrates the optimization of KRR-ML model parameters. Specifically, the kernel width ( $\sigma$ ) and the regularization parameter ( $\lambda$ ) are tuned to enhance prediction accuracy. The ML models use two main descriptors: the atomic Coulomb matrix (ACM) and atomic environment descriptors learned from GNN embeddings (AtmEnv). Notably, AtmEnv, trained using SchNet framework, shows superior performance in predicting CEBEs, exhibiting a mean absolute error (MAE) of less than 0.1 eV for out-of-sample predictions, which reflects the potential of GNNs in capturing detailed atomic environments.

The paper acknowledges the limitations of using Koopmans' approximation for baseline CEBEs, where significant deviations of over 17 eV were recorded compared to $G_{\Delta H}W_0$ reference calculations. However, by strategically employing Koopmans-derived baselines in a $\Delta$ -learning approach, the resultant ML models achieve satisfactory accuracy for practical applications.

The implications of this research are substantial, offering a route to efficient, high-throughput screening of materials by predicting XPS with minimal computational resources. Practically, the developed models within the Python module, {\tt cebeconf}, embody a freely available tool that could spearhead further explorations in XPS predictions.

In conclusion, this work exemplifies the fusion of chemistry, machine learning, and computational efficiency, aiming for rapid and accurate estimations of spectroscopic properties that could significantly aid experimental analysis and materials design. Future developments could include integrating more descriptors or extending this framework to larger and more complex molecular datasets, further bridging the gap between theoretical predictions and experimental spectroscopic analyses.

PDF Markdown

Tweets

https://twitter.com/raghurama123/status/1844020877262004273

https://twitter.com/raghurama123/status/1796382924725772296

Chemical Space-Informed Machine Learning Models for Rapid Predictions of X-ray Photoelectron Spectra of Organic Molecules (2405.20033v2)

Summary

Chemical Space-Informed Machine Learning Models for X-ray Photoelectron Spectra

Related Papers

Tweets