$\nabla^2$DFT: A Universal Quantum Chemistry Dataset of Drug-Like Molecules and a Benchmark for Neural Network Potentials

Published 20 Jun 2024 in physics.chem-ph, cs.LG, and stat.ML | (2406.14347v2)

Abstract: Methods of computational quantum chemistry provide accurate approximations of molecular properties crucial for computer-aided drug discovery and other areas of chemical science. However, high computational complexity limits the scalability of their applications. Neural network potentials (NNPs) are a promising alternative to quantum chemistry methods, but they require large and diverse datasets for training. This work presents a new dataset and benchmark called $\nabla^2$DFT that is based on the nablaDFT. It contains twice as much molecular structures, three times more conformations, new data types and tasks, and state-of-the-art models. The dataset includes energies, forces, 17 molecular properties, Hamiltonian and overlap matrices, and a wavefunction object. All calculations were performed at the DFT level ($\omega$B97X-D/def2-SVP) for each conformation. Moreover, $\nabla^2$DFT is the first dataset that contains relaxation trajectories for a substantial number of drug-like molecules. We also introduce a novel benchmark for evaluating NNPs in molecular property prediction, Hamiltonian prediction, and conformational optimization tasks. Finally, we propose an extendable framework for training NNPs and implement 10 models within it.

Abstract PDF Upgrade to Chat

Citations (2)

View on Semantic Scholar

Summary

The paper presents an expanded quantum chemistry dataset that doubles the number of molecules and triples conformational diversity for drug-like compounds.
It employs DFT calculations at the wB97X-D/def2-SVP level to generate extensive QC properties, including energies, forces, and Hamiltonian matrices for robust benchmarking.
Key results show that models like PhiSNet and SchNet excel in predicting Hamiltonians and energy-force correlations, advancing AI-driven drug discovery.

Overview of V2DFT: A Quantum Chemistry Dataset for Drug-Like Molecules

The paper entitled "V2DFT: A Universal Quantum Chemistry Dataset of Drug-Like Molecules and a Benchmark for Neural Network Potentials" introduces a significant resource in computational quantum chemistry. This dataset, V2DFT, expands upon the existing nablaDFT dataset to provide a comprehensive foundation for training neural network potentials (NNPs) in the field of molecular modeling with drug-like compounds.

Dataset and Benchmark Features

This work substantially extends the nablaDFT dataset by doubling the number of molecules and tripling conformational diversity, resulting in 1,936,929 molecules and 12,676,264 conformations. Each conformation has a wealth of calculated quantum chemistry (QC) properties, including energies, forces, Hamiltonian and overlap matrices, and wavefunction objects. Notably, all calculations are performed using Density Functional Theory (DFT) at the wB97X-D/def2-SVP level, which offers a balance of accuracy and computational efficiency.

The uniqueness of V2DFT lies in its inclusion of relaxation trajectories for drug-like molecules, specifically providing geometry optimization trajectories for about 60,226 conformations. This is a pivotal addition since geometry optimization is resource-intensive and critical in molecular modeling tasks. The dataset not only serves QC data but also acts as a benchmark for testing the capabilities of NNPs in three primary tasks: Hamiltonian prediction, energy and force prediction, and conformational optimization. The benchmark is complemented by a versatile framework incorporating adaptations of 10 state-of-the-art models, fostering further research and model evaluations.

Significant Findings and Methodologies

Numerical results from the paper illustrate the performance of various neural network architectures in predicting Hamiltonian matrices and molecular properties against benchmark metrics. PhiSNet emerges as a superior model in Hamiltonian prediction, outperforming others in mean absolute error (MAE) across different test splits, despite its computational demands. The research demonstrates that while some transformer-based models underperformed, message-passing neural networks (MPNNs) generally offered more accurate results, especially as data scale increased.

For energy and force predictions, the models show marked improvement with expanded training datasets. SchNet, known for its robust architecture, performed competitively in energy predictions. However, a critical observation is that models with dedicated force prediction outputs excel in force estimation yet lag in energy prediction. The exploration into conformational optimization also reveals that neural networks, especially when trained with optimization trajectories, can successfully challenge traditional simulation methods like RDKit's MMFF and xTB.

Implications and Future Directions

The V2DFT dataset addresses the pressing need for large, diverse data to train and refine NNPs, particularly within the pharmaceutical sector, where the modeling of drug-like molecules is paramount. This dataset could catalyze advancements in AI-driven drug discovery by enabling the development of more accurate and efficient predictive models for molecular properties. The paper suggests that future work could explore scaling neural wavefunction models and further integrate optimization trajectory data to enhance model performance. Additionally, addressing limitations such as the exclusion of non-drug-like molecules and consideration of polar solvation effects could broaden the dataset's applicability.

Conclusion

Overall, V2DFT stands as a pivotal resource in the pursuit of scalability and precision in molecular simulations. The dataset's rich composition enhances the ability to train robust NNPs and offers a comprehensive platform for evaluating model performance, paving the way for future innovations in neural network potentials and computational quantum chemistry.

Markdown