- The paper presents an ML framework that harnesses kernel ridge regression, transfer learning, and Δ-learning to predict XPS core-electron binding energies with high DFT-level accuracy.
- It employs a robust dataset of 12,880 organic molecules and 85,837 atomic entries, achieving a mean absolute error of less than 0.1 eV in out-of-sample predictions.
- The study offers a scalable, computationally efficient approach using universal force field geometries to bridge theoretical predictions with practical high-throughput screening.
Chemical Space-Informed Machine Learning Models for X-ray Photoelectron Spectra
This paper presents an advanced ML framework for predicting the X-ray photoelectron spectra (XPS) of organic molecules, specifically focusing on the core-electron binding energies (CEBEs) of carbon (C), nitrogen (N), oxygen (O), and fluorine (F). The authors leverage kernel-ridge regression (KRR) models, using a robust dataset of 12,880 small organic molecules with CEBEs calculated via the Δ-SCF method coupled with meta-generalized gradient approximation density functional theory (meta-GGA-DFT).
The primary dataset, known as -CEBECONF, comprises 85,837 entries for CONF atoms, employed to train atom-specific ML models. The authors incorporate transfer learning and Δ-learning approaches, enhancing the accuracy of predictions. Transfer learning is performed by utilizing atomic environment feature vectors derived from graph neural networks (GNNs), while the Δ-learning framework refines predictions using baseline Kohn–Sham eigenvalues.
A significant aspect of this work is the economical computation, achieved by relying on molecular geometries generated through universal force fields (UFF) rather than more computationally intensive methods. This approach makes predictions scalable across larger datasets without compromising the DFT-level accuracy significantly. Tested against SCAN/Tight-Full level calculations, the ML models achieve strong linear correlations between predicted and calculated CEBEs, which is noteworthy for virtual high-throughput screening despite not always yielding quantitatively accurate predictions on an individual molecular basis.
Technically, the paper illustrates the optimization of KRR-ML model parameters. Specifically, the kernel width (σ) and the regularization parameter (λ) are tuned to enhance prediction accuracy. The ML models use two main descriptors: the atomic Coulomb matrix (ACM) and atomic environment descriptors learned from GNN embeddings (AtmEnv). Notably, AtmEnv, trained using SchNet framework, shows superior performance in predicting CEBEs, exhibiting a mean absolute error (MAE) of less than 0.1 eV for out-of-sample predictions, which reflects the potential of GNNs in capturing detailed atomic environments.
The paper acknowledges the limitations of using Koopmans' approximation for baseline CEBEs, where significant deviations of over 17 eV were recorded compared to GΔHW0 reference calculations. However, by strategically employing Koopmans-derived baselines in a Δ-learning approach, the resultant ML models achieve satisfactory accuracy for practical applications.
The implications of this research are substantial, offering a route to efficient, high-throughput screening of materials by predicting XPS with minimal computational resources. Practically, the developed models within the Python module, {\tt cebeconf}, embody a freely available tool that could spearhead further explorations in XPS predictions.
In conclusion, this work exemplifies the fusion of chemistry, machine learning, and computational efficiency, aiming for rapid and accurate estimations of spectroscopic properties that could significantly aid experimental analysis and materials design. Future developments could include integrating more descriptors or extending this framework to larger and more complex molecular datasets, further bridging the gap between theoretical predictions and experimental spectroscopic analyses.