OGBG-MolHIV: Benchmark for HIV Drug Discovery
- OGBG-MolHIV is a standardized molecular graph dataset with 41,127 molecules used for predicting HIV-1 inhibition.
- It employs a Bemis–Murcko scaffold split to rigorously test out-of-distribution generalization and ensure reproducibility.
- The benchmark has driven advances in GNN architectures and ensemble methods, achieving significant ROC-AUC improvements.
OGB Molecular Dataset (ogbg-molhiv) provides a rigorously standardized benchmark for the evaluation of graph machine learning methods, specifically targeting molecular property prediction for HIV-1 inhibition. As one of the flagship datasets within the Open Graph Benchmark (OGB), ogbg-molhiv has driven advances in both graph neural network (GNN) architectures and classical machine learning approaches, enabling precise, out-of-distribution generalization analysis rooted in drug discovery. It is widely used as a testbed for algorithmic innovation, robust evaluation methodologies, and reproducibility in molecular ML systems (Hu et al., 2020).
1. Dataset Structure and Chemical Features
The ogbg-molhiv dataset consists of 41,127 small-molecule graphs derived from the MoleculeNet HIV dataset, with approximately 3.5–7% labeled as “active” inhibitors of HIV replication (Bugaud, 21 Mar 2026, Hu et al., 2020). Each molecule is represented as a graph, where:
- Nodes: Atoms, each with a 9-dimensional feature vector including atomic number, chirality tag, formal charge, implicit valence, and aromatic/aliphatic ring membership.
- Edges: Covalent bonds, with 3-dimensional bond features: bond type (single, double, triple, aromatic), stereo configuration (E/Z, cis/trans), and conjugation status.
Graph statistics indicate a mean of 25.5 atoms and 27.5 bonds per molecule, with a distribution spanning from ~10 to ~50 atoms and a long tail up to ~100, reflecting the structural diversity of drug-like compounds.
2. Split Strategy and Evaluation Protocol
OGB enforces a Bemis–Murcko scaffold split, partitioning molecules by their core chemical scaffolds to assess model generalization on novel chemical structures (Hu et al., 2020). The canonical split proportions are:
- Train: 32,901 molecules (∼80%)
- Validation: 4,113 molecules (10%)
- Test: 4,113 molecules (10%)
This split ensures that the test and validation sets contain scaffolds unseen during training, making random-split overfitting infeasible and emphasizing out-of-distribution robustness.
The primary evaluation metric is area under the receiver operating characteristic curve (ROC-AUC), computed in a pairwise manner: where , denote positive and negative classes, respectively, and provides the predicted score for each example.
3. Baseline and Advanced Modeling Methodologies
OGB Baselines
Four core GNN architectures were benchmarked as baselines (Hu et al., 2020):
All use 5 message-passing layers, hidden dimension 300, mean-pooling, and tuned dropout (0.0 or 0.5). Performance under scaffold split:
- Best OGB Baseline (GIN+VN+features): Test ROC-AUC = 77.07 ± 1.49%
- Richer atom/bond features and virtual-node augmentation provide 1–2% and ~1% lifts, respectively.
- Scaffold splitting reduces test AUC by ~5 points versus random splitting, underscoring the difficulty of generalizing beyond the training scaffolds.
Multi-RF Fusion with Multi-GNN Blending
The current SOTA is achieved by a method that ensembles 12 Random Forest (RF) classifiers trained on concatenated molecular fingerprints and blends them with deep-ensembled GNN predictions (Bugaud, 21 Mar 2026):
- Fingerprints: “fcfp_ext” vector, 4,263-dimensional, is the concatenation of FCFP (radius 2, 3), ECFP (radius 2), MACCS keys, and hashed atom pairs.
- RF Configurations: 12 models, sweeping max_features across {0.18, 0.20, 0.22}; best is 0.20, yielding an AUC gain of +0.008 over the default .
- GNN Ensemble: Two architectures (GIN-VN 5-layer/300-d, GIN-VN-deep 8-layer/256-d), each trained on 10 seeds. Mean-ensembling per-arch mitigates seed variance.
- Blending: Final prediction is a weighted rank average,
- Performance: Test ROC-AUC = 0.8476 ± 0.0002, surpassing previous methods; deep GNN ensembling reduces std from 0.0008 to 0.0002.
| Component | ROC-AUC | Δ |
|---|---|---|
| RF+2 GNNs @6% each (full blend) | 0.8476 ±0.0002 | — |
| Single GNN (GIN-VN @12%) | 0.8475 ±0.0002 | –0.0001 |
| RF only (no GNN) | 0.8461 ±0.0002 | –0.0015 |
| Per-seed GNN @7% (no deep ensemble) | 0.8467 ±0.0011 | –0.0009 |
| max_features = (default) | 0.8396 ±0.0005 | –0.0080 |
4. Self-Supervised Pre-training and Motif-Driven Contrastive Learning
MICRO-Graph applies self-supervised contrastive learning using motifs (frequent subgraph patterns) as a basis for subgraph sampling and representation (Zhang et al., 2020):
- Motif Discovery: Subgraphs (motifs) are identified by differentiable EM clustering on node embeddings, with each motif represented as .
- Assignment Optimization: Balanced motif assignments are enforced with doubly-stochastic constraints, solved using Sinkhorn–Knopp; cross-entropy losses with spectral regularization ensure diversity.
- Contrastive Pre-training: Graph-to-subgraph InfoNCE loss contrasts graph-level with motif-induced subgraph-level representations.
- Architecture: DeeperGCN backbone, 5 layers, , Adam optimizer, batch size 512, , motifs (0).
MICRO-Graph demonstrates a +2.04% average ROC-AUC gain on OGB molecule tasks, outperforming node-level and random-subgraph approaches on both transfer fine-tuning and feature extraction regimes.
| SSL Method | bace | bbbp | clintox | hiv | sider | tox21 | toxcast | Average |
|---|---|---|---|---|---|---|---|---|
| Non-Pretrain | 72.8 | 82.1 | 74.98 | 73.4 | 55.7 | 76.1 | 63.3 | 71.19 |
| MICRO-Graph | 77.2 | 84.4 | 77.02 | 75.1 | 56.7 | 77.0 | 65.2 | 73.23 |
5. Implementation, API Access, and Reproducibility
The dataset is natively accessible with the OGB API. In PyTorch Geometric (PyG), loading, splitting, and evaluation can be achieved in a few lines:
1
Standard GNN hyperparameters match those recommended in OGB: 5 message-passing layers, hidden dim 300, mean-pooling, Adam optimizer, dropout (0/0.5), and early stopping.
Multi-RF Fusion can be reproduced with open-source libraries: RDKit for fingerprint extraction, scikit-learn for RFs, and PyG/DGL for GNNs. The pipeline, including pseudocode, fingerprint vector construction, and reproducibility across seeds, is detailed in (Bugaud, 21 Mar 2026).
6. Scientific Impact and Research Directions
OGB-molhiv’s chemically motivated split protocol and rich chemical features have established it as the definitive benchmark for molecular GNN research. Scaffold splitting, as opposed to random splitting, rigorously tests models’ true out-of-distribution (OOD) generalization capacity. The evolving SOTA—from GIN+VN through RF–GNN ensembles to motif-driven contrastive frameworks—highlights the importance of both expressive node/bond encoding and advanced pre-training for molecular ML.
Room for algorithmic improvement remains significant: the best OGB GNN baseline achieves ~77% test ROC-AUC, while blended RF–GNN ensembles lift performance by over 7 points, and pre-trained contrastive methods further advance generalization. A plausible implication is that hybrid and motif-centered paradigms, potentially together with domain-knowledge integration, will remain an active focus in the pursuit of robust molecular property predictors (Bugaud, 21 Mar 2026, Zhang et al., 2020, Hu et al., 2020).