GEOM-Drugs Benchmark

Updated 2 July 2025

GEOM-Drugs Benchmark is a chemically rigorous platform for evaluating 3D molecular structure generation and drug discovery methods.
It integrates over 37 million quantum chemistry-annotated 3D conformers with refined evaluation metrics for stability, geometry, and energy.
Its open-source protocols and strict data curation ensure reproducible, chemically accurate comparisons across generative modeling approaches.

The GEOM-Drugs Benchmark is a widely adopted, chemically and computationally rigorous platform for assessing machine learning and generative modeling methods in 3D molecular structure generation, property prediction, and drug discovery. It encompasses protocols, evaluation metrics, and datasets that collectively define current standards for benchmarking in this field.

1. Origins and Composition of GEOM-Drugs

The benchmark originated with the construction of the GEOM dataset ("Energy-annotated molecular conformations for property prediction and molecular generation" (2006.05531)), which provides over 37 million quantum chemistry-annotated 3D conformers for more than 450,000 organic and drug-like molecules. Key features include:

Molecular Scope: Includes drug-like molecules drawn from QM9, MoleculeNet subsets, the AICures initiative, and a range of pharmaceutical-relevant datasets, with sizes ranging from small (≤9 heavy atoms, QM9) up to very large (up to 181 atoms, AICures drugs).
Conformer Ensembles: Each molecule is associated with an ensemble of 3D-conformers, generated using the CREST algorithm (metadynamics with GFN2-xTB optimization) and, for high-value subsets, further refined using DFT and the CENSO protocol.
Property Annotation: Datasets include energies, Boltzmann statistical weights for conformers, dipole moments, molecular orbital data, and extensive experimental property labels (e.g., BACE-1 inhibition, solubility, toxicity, CNS penetration).

This dataset and its downstream benchmarking tasks underpin many subsequent methodological advances and critical evaluations in the field of molecular generative modeling and drug design.

2. Technical Evolution and Evaluation Metrics

2.1. Original Evaluation Practices

Early benchmark protocols for generative 3D models on GEOM-Drugs measured several core metrics:

Molecular Stability: Based on atom valency; a molecule is deemed "stable" if all atom valencies conform to chemically accepted rules.
Geometric Validity: Assessed by the fraction of generated molecules whose 3D coordinates match plausible chemical geometries.
Novelty, Uniqueness, and Diversity: Computed using substructure (e.g., Tanimoto distance of fingerprints), scaffold, and geometric similarity indices.
Energy-based Relaxation: Comparison between generated and reference conformer energies (often using force fields such as MMFF94 or quantum methods).

However, critical flaws in these original metrics and their implementation were later identified (see Section 3).

2.2. Corrected Evaluation Protocols

A pivotal update was provided by "GEOM-Drugs Revisited: Toward More Chemically Accurate Benchmarks for 3D Molecule Generation" (2505.00169), which exposed bugs and inconsistencies in prevalent protocols:

Valency Calculations: Prior definitions counted aromatic bonds as single instead of 1.5; lookup tables inadvertently allowed nonphysical atom environments, inflating stability metrics.
Bug Fixes: Reformulated empirical valency tables, treating aromatic systems via the tuple $(n_{\text{arom}}, v_{\text{other}})$ to distinguish environments (see Figure 1 and SI Table 3).
3D Geometric Comparison: Replaced atom–atom distance lookups (which strongly misclassified GFN2-xTB geometries) with direct, bond-centric deviations between generated and reference structures:

$\Delta r_{ij} = | r_{ij}^{\text{init}} - r_{ij}^{\text{opt}} |, \quad \Delta \theta_{ijk} = \min \left( |\theta_{ijk}^{\text{init}} - \theta_{ijk}^{\text{opt}}|, 180^\circ - |\theta_{ijk}^{\text{init}} - \theta_{ijk}^{\text{opt}}| \right), \quad \Delta \phi_{ijkl} = \min \left( |\phi_{ijkl}^{\text{init}} - \phi_{ijkl}^{\text{opt}}|, 360^\circ - |\phi_{ijkl}^{\text{init}} - \phi_{ijkl}^{\text{opt}}| \right)$

Energy Metrics: Use GFN2-xTB for all energy relaxation benchmarking to maintain level-of-theory consistency:

$\Delta E_\text{relax} = E^{\text{relaxed}}_\text{GFN2-xTB} - E^{\text{generated}}_\text{GFN2-xTB}$

Strict Data Curation: Removing artifacts, such as molecules with broken rings or abnormal valencies, to avoid spurious results.

These measures yield chemically and physically meaningful metrics, ensuring cross-model comparison and preventing the overestimation of model performance.

3. Impact and Findings from Recent Benchmarks

3.1. Model Performance under Corrected Evaluation

Retraining and retesting leading generative models—such as EQGAT-Diff, JODO, Megalodon, SemlaFlow, and others—on the cleaned, kekulized GEOM-Drugs dataset with the updated metrics produced several notable results:

Molecular Stability Scores: Dropped by up to 50% with bug corrections, but could be restored (within 3% of reference molecules) using the refined valency table.
3D Accuracy: Recent diffusion-based generative models (e.g., JODO) achieved lower geometric and energy deviations from GFN2-xTB geometries than the MMFF94 force field, making MMFF94-based evaluations obsolete.
Data Preprocessing: Training on kekulized molecules (to resolve aromatic ambiguities) further improved stability by ~5% for most models.
Model Ranking Consistency: Relative ordering among models persisted with the revised framework, but absolute performance was now chemically reliable.

3.2. Empirical Recommendations

The benchmark protocols now recommend:

Using dynamically learned or empirical valency tables from high-quality chemical datasets.
Refraining from simple atom–atom distance tables for stability and 3D geometry assessment, in favor of bond- and angle-based metrics directly compared to GFN2-xTB reference structures.
Performing all relaxation and energy calculations using a consistent quantum chemical method (GFN2-xTB) rather than a force field to prevent misalignment with reference data.
Complete code and scripts for dataset processing and evaluation are openly accessible (see https://github.com/isayevlab/geom-drugs-3dgen-evaluation).

4. Community Practices and Reproducibility

The open-source release of all processing scripts, evaluation tools, kekulization utilities, valency calculators, and dataset splits has fostered a reproducible and standardized benchmarking environment. Researchers can now:

Apply chemically accurate stability and geometry assessment across generative models.
Consistently preprocess molecules to avoid ambiguous bond representations.
Transparently report and compare results, with access to full validation protocols.

This mitigates prior inconsistencies, supports fair competition, and accelerates method development.

5. Significance for Drug Discovery and Model Development

The GEOM-Drugs Benchmark, with its corrected evaluation protocol, serves as the de facto standard for:

Assessment of Generative Models: Especially in unconditional 3D molecule generation for drug-like compounds at realistic size scales (up to 181 atoms).
Development of Chemically-Aware Models: Encourages integrating chemical knowledge and quantum reference data into training and evaluation.
Progress Tracking: Provides reliable, chemically meaningful progress measurements, necessary for industrial and academic adoption.

Table: Key Metric Definitions in the Updated GEOM-Drugs Benchmark

Metric	Correct Formulation (per (2505.00169))	Comment
Molecular stability	Empirical, chemically validated valency tables	Includes aromatic bond environments
3D geometry	$\Delta r_{ij}$ , $\Delta \theta_{ijk}$ , $\Delta \phi_{ijkl}$ against GFN2-xTB	Direct bond/angle/torsion comparison
Energy	$\Delta E_\text{relax}$ (GFN2-xTB)	Avoids force field comparison
Data cleaning	Filter fragments, abnormal valencies	Ensures benchmarking chemical plausibility

6. Ongoing Challenges and Future Directions

While the GEOM-Drugs Benchmark now provides a robust and chemically-consistent platform, several open challenges persist:

Extending valency and stability rules to exotic or transition-metal-containing compounds remains non-trivial.
Efficient, high-throughput evaluation for even larger molecules or combinatorial libraries demands further algorithmic innovation.
The benchmark currently focuses on drug-like organic molecules; adaptations may be needed for broader chemical spaces.

A plausible implication is that future benchmarks may incorporate dynamic learning of chemical rules from ever-larger and more diverse datasets, further integrate quantum mechanical validation, and extend into multi-property or multi-objective settings as model complexity and scope grow.

The GEOM-Drugs Benchmark, through continuous correction, open-source evaluation code, and rigorous chemical validation, now provides a robust infrastructure for the quantitative evaluation and development of deep generative and property prediction models for molecular drug design. It is maintained as an evolving community resource to ensure ongoing methodological progress is grounded in chemically sound assessment.