Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

GEOM: Energy-annotated molecular conformations for property prediction and molecular generation (2006.05531v4)

Published 9 Jun 2020 in physics.comp-ph and cs.LG

Abstract: Machine learning (ML) outperforms traditional approaches in many molecular design tasks. ML models usually predict molecular properties from a 2D chemical graph or a single 3D structure, but neither of these representations accounts for the ensemble of 3D conformers that are accessible to a molecule. Property prediction could be improved by using conformer ensembles as input, but there is no large-scale dataset that contains graphs annotated with accurate conformers and experimental data. Here we use advanced sampling and semi-empirical density functional theory (DFT) to generate 37 million molecular conformations for over 450,000 molecules. The Geometric Ensemble Of Molecules (GEOM) dataset contains conformers for 133,000 species from QM9, and 317,000 species with experimental data related to biophysics, physiology, and physical chemistry. Ensembles of 1,511 species with BACE-1 inhibition data are also labeled with high-quality DFT free energies in an implicit water solvent, and 534 ensembles are further optimized with DFT. GEOM will assist in the development of models that predict properties from conformer ensembles, and generative models that sample 3D conformations.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Simon Axelrod (12 papers)
  2. Rafael Gomez-Bombarelli (50 papers)
Citations (178)

Summary

  • The paper introduces GEOM, a comprehensive dataset of 37 million energy-annotated 3D conformers from over 450,000 molecules aimed at improving property prediction and molecular generation.
  • It employs advanced sampling with CREST and DFT-based refinement via CENSO to capture dynamic ensemble behaviors of molecules accurately.
  • The dataset enables machine learning models to account for molecular flexibility, enhancing applications in drug design, material discovery, and theoretical chemistry.

An Essay on the GEOM Dataset: Enhancing Molecular Conformation Predictions in Computational Chemistry

The paper "GEOM: Energy-annotated molecular conformations for property prediction and molecular generation" introduces the Geometric Ensemble Of Molecules (GEOM), a novel dataset specifically curated to improve computational approaches in molecular property prediction and molecular generation. This dataset addresses significant gaps in current molecular datasets by offering energy-annotated 3D conformer ensembles. The development and application of this dataset are expected to advance machine learning techniques in molecular design, particularly by integrating the often neglected dynamic aspects of molecular conformations.

Background and Significance

Traditional molecular representation methods, which include 2D chemical graphs and single 3D structures, often overlook the dynamic nature of molecules. Molecules exist as ensembles of conformers, varying significantly on the potential energy surface, especially at finite temperatures. Understanding and predicting molecular properties demand insights into these ensembles rather than static representations, where dynamic flexibility can impact molecular interactions and reactivity.

Machine learning, which has been increasingly employed to predict molecular properties, stands to benefit significantly from datasets that account for this molecular flexibility. The GEOM dataset, utilizing advanced sampling and semi-empirical DFT calculations, fills this need by providing a comprehensive collection of 37 million conformers across more than 450,000 molecules. This includes substantial subsets from the QM9 dataset, experimental compounds impacting biophysics, physiology, and physical chemistry, as well as specific structures with BACE-1 inhibition data annotated using high-quality DFT free energies in aqueous environments.

Dataset and Methodology

GEOM is distinctive because of its scale, the number of conformers per species, and the quality of conformational data, aligning computational predictions more closely with experimental observations. The dataset allows benchmarking of machine learning models that use conformers as inputs to better predict experimental properties. This approach contrasts with existing methodologies that typically rely on single or limited representations and thus face challenges in molecular design tasks.

The dataset is generated using the CREST software, which efficiently samples conformational space employing meta-dynamics simulations and semi-empirical tight-binding DFT calculations. CREST's approach facilitates the exploration of dynamic bond rotations and other low-energy transformations, producing outputs that approximate the ensemble of thermally feasible conformations.

CREST-generated conformers are further refined using the CENSO program, which employs accurate levels of DFT for energy optimization and bias-correction of rotational, vibrational, and translational free energies. These enhancements ensure the statistical weights of conformers more accurately reflect their true thermodynamic weights, providing a reliable benchmark for conformational ensemble-based model predictions.

Implications and Future Directions

The GEOM dataset is poised to significantly impact both practical and theoretical domains in computational chemistry and machine learning. Practically, it serves as a foundation for developing machine learning models that can predict complex quantum-chemical properties with greater precision, aiding in tasks like drug design and material synthesis. Theoretically, it sheds light on the importance of molecular flexibility, encouraging further exploration into novel algorithms that incorporate dynamic conformational data.

Furthermore, GEOM facilitates the training of generative models that predict conformational structures from molecular graphs—an area of increasing interest due to computational constraints associated with exhaustive torsional sampling. As models improve, accurate predictions of conformers from molecular graphs may become routine, opening avenues for faster and more efficient molecular screenings and optimizations.

Future developments in AI could leverage datasets like GEOM to more precisely navigate chemical space, leading to significant breakthroughs in molecular design and discovery. With an emphasis on improving conformational ensemble predictions, GEOM serves as both a benchmark and a resource for continued innovation in computational approaches to molecular science.