Papers
Topics
Authors
Recent
2000 character limit reached

HierDiff: Hierarchical Diffusion Model for 3D Molecules

Updated 8 January 2026
  • The paper introduces HierDiff, a model that leverages SE(3)-equivariant diffusion processes on coarse chemical fragments to generate accurate 3D molecular structures.
  • It employs a three-stage pipeline—coarse-grained diffusion, fine-grained EGNN decoding, and atomic-level assembly—to ensure global chemical validity and consistency.
  • Empirical results show superior molecule validity, lower RMSD, and improved drug-likeness on benchmarks like GEOMDRUG and QM9 compared to atom-level approaches.

HierDiff is a hierarchical, coarse-to-fine generative diffusion model for 3D molecular structure generation designed to address limitations of atom-level methods in molecular modeling, especially for large and complex drug-like molecules. It employs SE(3)-equivariant neural and diffusion processes to generate molecular structures by leveraging intrinsic local chemical fragments (such as rings and functional groups) as coarse units, then decoding and assembling them into full atom-level conformations. The approach obviates the need for autoregressive sampling or combinatorial rejection steps, providing scalable, high-quality, and valid 3D molecule generation (Qiang et al., 2023).

1. Hierarchical Model Architecture

HierDiff's generative process is organized into three principal stages:

  1. Coarse-grained diffusion: Molecules are decomposed into fragment graphs G=(V,E)G = (V, E), with each fragment ii represented by an SE(3)-invariant feature vector hifh^\mathrm{f}_i (property or element histogram based) and an SE(3)-equivariant position vector hipR3h^\mathrm{p}_i \in \mathbb{R}^3 marking the fragment center. An equivariant Gaussian diffusion model is trained on joint representations H=[Hf,Hp]H = [H^\mathrm{f}, H^\mathrm{p}] to model pθ(H)p_\theta(H), allowing the sampling of valid, globally consistent 3D arrangements of fragments.
  2. Fine-grained decoding and message passing: Starting from the set of coarse fragments (with no fine-grained types or edges assigned), the model incrementally grows a detailed fragment graph using a series of modified equivariant graph neural network (EGNN) layers. Each step involves choosing a focal node, predicting the attachment link to another coarse node, selecting the fragment type to install, and iteratively refining decoded substructure assignments to maintain chemical and structural consistency.
  3. Atomic-level assembly: Fully decoded fragment graphs with 3D fragment centers are "stitched" at the atomic level. Local attachment geometries are enumerated using RDKit's ETKDG, with the configuration minimizing the root-mean-squared deviation (RMSD) of fragment centers selected. Kabsch alignment ensures coherent transcript of local geometries into the global coordinate frame while maintaining long-range consistency.

This pipeline ensures preservation of local motif validity, avoids the pitfalls of pure atom-level sampling, and precludes combinatorial rejection sampling.

2. Equivariant Diffusion Process on Molecular Fragments

Let MM be the number of fragments in a molecule. Each sample is encoded as HfRM×dfH^\mathrm{f} \in \mathbb{R}^{M \times d_f} for invariant fragment features, and HpRM×3H^\mathrm{p} \in \mathbb{R}^{M \times 3} for 3D positions. The forward noising process is:

q(HtHt1)=N(Ht;1βtHt1,βtI)q(H^t | H^{t-1}) = \mathcal{N}(H^t; \sqrt{1-\beta_t} H^{t-1}, \beta_t I)

and the reverse (denoising) process is:

pθ(Ht1Ht)=N(Ht1;μθ(Ht,t),βtI)p_\theta(H^{t-1} | H^t) = \mathcal{N}(H^{t-1}; \mu_\theta(H^t, t), \beta_t I)

where μθ\mu_\theta is an SE(3)-equivariant transformation implemented by a graph neural network. The model's objective is given by a variational bound:

L=L0+LT++L1\mathcal{L} = \mathcal{L}_0 + \mathcal{L}_T + \dots + \mathcal{L}_1

with specific forms for classification and denoising score matching for integer and continuous features. The setup guarantees invariance under global rotation and translation, with a proof based on equivariant design of p(HT)p(H^T) and μθ\mu_\theta.

3. Fine-Grained Decoding via Equivariant Message Passing

All fine decoding steps utilize a modified EGNN architecture. At each stage, node features nin_i, coordinates xix_i, and edge embeddings eije_{ij} are updated by:

  • mij=ϕe(eij,xixj2)m_{ij} = \phi_e(e_{ij}, \|x_i - x_j\|^2)
  • ni=ni+jψn(ni,nj,mij)n_i' = n_i + \sum_j \psi_n(n_i, n_j, m_{ij})
  • xi=xi+jκ(xixj)γ(mij)x_i' = x_i + \sum_j \kappa (x_i - x_j) \gamma(m_{ij})
  • eij=eij+ψe(eij,ni,nj,xixj2)e_{ij}' = e_{ij} + \psi_e(e_{ij}, n_i, n_j, \|x_i - x_j\|^2)

Four specialized EGNN-based modules underpin decoding:

  • focal\varnothing_\mathrm{focal}: Softmax over nodes to select the focal node.
  • DedgeD_\mathrm{edge}: Scores each coarse node as an attachment site.
  • OnodeO_\mathrm{node}: Predicts the fragment type for new fine node placements.
  • refine\varnothing_\mathrm{refine}: Masked prediction EGNN to reassign fragment types for resolved consistency.

Different connectivity patterns and readout heads are employed for each step, enforcing both chemical valence rules and spatial coherence.

4. Iterative Refinement for Combinatorial Consistency

To resolve combinatorial inconsistencies (e.g., valence or attachment conflicts) after fragment insertions, HierDiff introduces an iterative refinement module. For the current fragment graph TT, and one-hot fragment types f(Vi)f(V_i), the refinement density is

P(f1,,fKH)=i=1KPθ,ref(f(Vi)T{i},H)P^*(f_1,\dots,f_K \mid H) = \prod_{i=1}^K P_{\theta, \text{ref}}(f(V_i) \mid T\setminus\{i\}, H)

Sampling is performed via short Gibbs-style Markov chain Monte Carlo, repeatedly masking a node, resampling its type, and accepting updates until convergence criteria are met (typically 1–3 passes). Training involves optimizing the masked likelihood:

Lrefine=EdatalogPθ,ref(f(Vi)T{i},H)\mathcal{L}_\text{refine} = -\mathbb{E}_\text{data} \log P_{\theta,\text{ref}}(f^*(V_i) | T\setminus\{i\}, H)

This module is critical for ensuring that globally sampled fragment assignments remain chemically valid without the need for expensive rejection steps.

5. Atomic-Level Conformation Assembly

After fine-level fragment decoding, HierDiff proceeds to determine atomic connectivity and spatial arrangement:

  • For each fragment-neighbor pair (Fi,Fj)(F_i,F_j):

    1. Enumerate chemically valid atom-atom attachments (guided by valence rules and RDKit's ETKDG conformer generator).
    2. Compute fragment-center RMSD between candidate local conformations and the sampled fragment centers.
    3. Select the attachment minimizing RMSD.
  • For each chosen configuration, the optimal rigid transform (R,t)(R,t) is computed with the Kabsch algorithm to register local geometries into the global frame, aligning all atoms of fragment jj accordingly and proceeding recursively through the molecular graph.

This approach maintains long-range consistency across the full molecular structure, with atomic-level assembly not requiring separate global optimization steps.

6. Implementation Overview

Key implementation details include:

  • Fragment vocabulary: Constructed via a 3D-aware tree decomposition (as in Jin et al. 2018), with a vocabulary size around 800 to encompass rings, acyclic bonds, and bridged rings, while avoiding cycles.
  • Coarse feature options: "Property" encoding (8 dimensions capturing chemical features); "element" encoding (3 dimensions capturing element/hydrogen/charge counts).
  • Diffusion parameters: 1000 steps, linear βt\beta_t from 10410^{-4} to 0.02, batch size 256, and a 6-layer EGNN for the drift network with 128 hidden units.
  • Fine decoder: EGNN stacked 4–6 layers, 128 hidden dimensions, 64 edge latent dimensions, using ReLU activations in all MLPs.
  • Training: Adam optimizer, learning rate 3×1043 \times 10^{-4}, 200 epochs.
  • Refinement: 3 MCMC steps per inference.
  • Assembly: RDKit conformer calls O(1)\mathcal{O}(1) per edge, Kabsch alignment for O(d3)\mathcal{O}(d^3) for dd shared atoms. Sampling time \sim3s per molecule on an NVIDIA 1080Ti.

7. Empirical Results and Evaluation

HierDiff has been evaluated across multiple benchmarks with the following principal outcomes:

  • GEOMDRUG (304k drug-like molecules):
    • Best QED (quantitative drug-likeness) of 0.639 (vs EDM 0.608).
    • Complete molecule validity of 90.4% (vs EDM 83.5%).
    • Reduced mean molecular weight gap, 13.3 amu (vs EDM 23.7 amu).
  • Conformational quality:
    • Fragment-RMSD 2.43 Å (vs EDM 3.23 Å).
    • Atom-RMSD coverage 0.546 (vs EDM 0.489).
  • QM9 small-molecule validity:
    • 100% connectivity (vs EDM 91.9%); uniqueness ≈98%.
  • Ablation studies:
    • Removing iterative refinement increases synthetic accessibility score (SAS, indicating worse drug-likeness), and decreases QED and MCF.
    • Reducing diffusion steps to as low as 250 retains ≥80% output quality.
    • Restricting the vocabulary to simple rings/bonds degrades all drug-likeness metrics.
  • Conditional generation:
    • Demonstrates lower conditional mean absolute error (MAE) than EDM on 3 out of 4 molecular property metrics (asphericity, QED, SAS, LogP).

Across all empirical tests, HierDiff demonstrates superior validity and geometric quality compared to prior atom-level diffusion methods, with global consistency and chemical validity guaranteed by the integration of coarse-to-fine diffusion, message passing, iterative refinement, and deterministic atomic assembly. Removal of any single module leads to measurable drops in these metrics, underlining the necessity of the full hierarchical design (Qiang et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to HierDiff Model.