Molecule3D: A Benchmark for Predicting 3D Geometries from Molecular Graphs (2110.01717v1)

Published 30 Sep 2021 in cs.LG and cs.AI

Abstract: Graph neural networks are emerging as promising methods for modeling molecular graphs, in which nodes and edges correspond to atoms and chemical bonds, respectively. Recent studies show that when 3D molecular geometries, such as bond lengths and angles, are available, molecular property prediction tasks can be made more accurate. However, computing of 3D molecular geometries requires quantum calculations that are computationally prohibitive. For example, accurate calculation of 3D geometries of a small molecule requires hours of computing time using density functional theory (DFT). Here, we propose to predict the ground-state 3D geometries from molecular graphs using machine learning methods. To make this feasible, we develop a benchmark, known as Molecule3D, that includes a dataset with precise ground-state geometries of approximately 4 million molecules derived from DFT. We also provide a set of software tools for data processing, splitting, training, and evaluation, etc. Specifically, we propose to assess the error and validity of predicted geometries using four metrics. We implement two baseline methods that either predict the pairwise distance between atoms or atom coordinates in 3D space. Experimental results show that, compared with generating 3D geometries with RDKit, our method can achieve comparable prediction accuracy but with much smaller computational costs. Our Molecule3D is available as a module of the MoleculeX software library (https://github.com/divelab/MoleculeX).

Authors (10)

Zhao Xu (47 papers)
Youzhi Luo (17 papers)
Xuan Zhang (183 papers)
Xinyi Xu (42 papers)
Yaochen Xie (20 papers)
Meng Liu (112 papers)
Kaleb Dickerson (2 papers)
Cheng Deng (67 papers)
Maho Nakata (13 papers)
Shuiwang Ji (122 papers)

Citations (37)

View on Semantic Scholar

Summary

The paper introduces Molecule3D, a benchmark that uses graph neural networks to predict ground-state 3D molecular geometries from molecular graphs.
It presents two baseline models with DeeperGCN-DAGNN, achieving accuracy comparable to RDKit ETKDG while significantly reducing computational costs.
The study highlights practical implications for accelerating molecular simulations and advancing applications in drug discovery and materials science.

Insights on Molecule3D: A Benchmark for Predicting 3D Geometries from Molecular Graphs

The paper Molecule3D: A Benchmark for Predicting 3D Geometries from Molecular Graphs presents a significant contribution in the sphere of molecular geometry prediction using graph neural networks (GNNs). By establishing a novel benchmark, Molecule3D, the authors aim to address existing gaps in the predictive modeling of ground-state 3D geometries from molecular graphs, circumventing the prohibitive computational expenses of quantum calculations like Density Functional Theory (DFT).

The Significance of Molecule3D

The introduction of Molecule3D marks a notable shift towards leveraging machine learning for predicting 3D molecular structures. The dataset composed of approximately 4 million molecules from PubChemQC, with DFT-derived geometries, stands as a formidable resource. By providing a large-scale dataset, Molecule3D paves the way for systematic evaluation and development of machine learning models for this task. The focus on ground-state geometries is crucial as these depict the stable and energy-minimized conformations of molecules that are pivotal in applications such as molecular dynamics, biological activity predictions, and ligand design.

Methodology and Baseline Methods

Two baseline methods are proposed using the DeeperGCN-DAGNN model, reflecting differing approaches to prediction. These methods predict either pairwise atom distances or direct 3D coordinates, allowing a nuanced analysis of performance. The four proposed metrics—MAE, RMSE, and two validity scores—enable a thorough assessment of predicted geometries both in terms of accuracy and practical viability. It is noteworthy that the predicted methods achieve comparable accuracy to the traditional RDKit ETKDG algorithms but with significantly reduced computational costs.

Results and Discussion

The research presents strong numerical results, demonstrating that the deep learning approach not only rivals but occasionally surpasses traditional methods in prediction accuracy. Particularly under random splits, the methods yield smaller MAE and RMSE values compared to RDKit ETKDG. However, a challenge remains with scaffold splits due to dynamic structural variations, which necessitates further advancements in model architecture to handle out-of-distribution generalizations effectively.

An important insight is the trade-off between prediction error and geometric validity in terms of EDMs, suggesting future work might focus on integrated approaches balancing these aspects optimally. The dramatic reduction in computational time, as evidenced by the 25 to 45 minutes requirement for predicting geometries of the entire test set, underscores the practical applicability of the proposed models in accelerating molecular simulations.

Implications and Future Directions

The implications of Molecule3D are profound in both theoretical and practical dimensions. Theoretically, it challenges the conventional wisdom favoring physics-based computation for molecular geometry determination, promoting machine learning as a viable alternative. Practically, the efficiency gains suggest a transformative impact on various fields requiring rapid and accurate molecular simulations, potentially revolutionizing drug discovery, materials science, and quantum chemistry applications.

The authors propose several directions for future research, including the exploration of novel models capable of more accurately predicting molecular geometries with both high geometric validity and prediction accuracy. Expanding the dataset to include a broader range of molecules and pre-training with similarly optimized datasets (e.g., using PM6) are also anticipated. Moreover, further innovation in metrics, such as incorporating bond angles and dihedral angles, presents an opportunity to refine evaluation criteria and subsequent model iterations.

In conclusion, the development of Molecule3D represents a significant advancement in molecular simulations through machine learning, underscoring the evolving capabilities and efficiencies of predictive models in computational chemistry. As the research community confronts the challenges associated with this nascent approach, the prospects for broader applications and enhanced simulation methodologies appear promising.

PDF Markdown

Related Papers

GitHub

GitHub - divelab/MoleculeX (163 stars)