ConGLUDe: Unified Geometric Drug Design

Updated 15 January 2026

The paper introduces a novel contrastive geometric learning approach that unifies structure-based and ligand-based drug design without relying on explicit pocket annotations.
It leverages a dual encoder strategy using a geometric protein encoder and a lightweight ligand encoder, integrated through multi-axis contrastive loss functions.
Empirical results show significant improvements in zero-shot virtual screening, target fishing, and pocket prediction, offering scalable and transferable drug discovery capabilities.

Contrastive Geometric Learning for Unified Computational Drug Design (ConGLUDe) is a foundation-model paradigm that combines structure-based and ligand-based computational drug design within a single, contrastively trained geometric representation. By aligning representations of proteins, candidate binding pockets, and small molecules through multi-axis contrastive learning, ConGLUDe supports zero-shot virtual screening, ligand-conditioned pocket selection, and target fishing at scale—all without requiring explicit binding-site annotation as input. The method builds upon multi-view contrastive learning advances from molecular representation learning (notably UniCorn), extending their capabilities to the integration of protein–ligand structural complexes and large-scale bioactivity data, thereby enabling unified foundation modeling for end-to-end drug discovery (Feng et al., 2024, Schneckenreiter et al., 14 Jan 2026).

1. Foundational Principles and Motivation

Traditional computational drug discovery bifurcates into structure-based (SBDD) and ligand-based (LBDD) approaches, historically relying on disjunct data sources—3D co-crystal complexes and annotated protein pockets for SBDD, contrasted with bioactivity matrices for LBDD. This separation impedes scale and transferability. ConGLUDe unifies these paradigms by constructing a geometric protein encoder (based on modified VN-EGNN architectures) and a fast ligand encoder, then employing contrastive objectives to align molecular and structural views: global protein, multiple candidate binding pockets, and small-molecule ligands. The approach obviates the need for predefined pocket annotations and enables implicit pocket discovery that is ligand- and task-adaptive (Schneckenreiter et al., 14 Jan 2026).

The method extends the tri-view contrastive learning architecture exemplified by UniCorn, where molecular representations are unified across 2D graph, 2D masked fragment, and 3D conformer levels via joint contrastive and denoising objectives. These foundational ideas are transplanted into the drug-design context via explicit addition of a 4th “structure view” and by introducing protein–ligand complex-aware contrastive alignment (Feng et al., 2024).

2. Model Architecture: Protein and Ligand Encoders

ConGLUDe’s architecture is characterized by:

Geometric Protein Encoder: The protein structure is represented as a graph, where each node corresponds to a residue (with ESM-2 embeddings, dimension 1280); edges connect each residue to its 10 nearest neighbors within a 10 Å radius. Virtual nodes are introduced for pocket centers and a global protein context. Five heterogeneous message passing layers (VN-EGNN style) update residue, pocket, and global features. After clustering candidate pocket centers (DBSCAN), the model outputs:
- A global protein embedding ( $\mathbf{p}\in \mathbb{R}^D$ )
- $K$ pocket embeddings with 3D coordinates ( $\mathbf{b}_k\in\mathbb{R}^D,\ \hat{\mathbf{z}}_k\in\mathbb{R}^3$ )
Ligand Encoder: Each molecule is represented using a 2048-bit Morgan fingerprint concatenated with 210 RDKit descriptors (total 2258 dimensions); a 2-layer MLP generates a $2D$-dimensional vector, split evenly into a protein-matching subspace ( $\mathbf{m}_\mathrm{p}$ ), and a pocket-matching subspace ( $\mathbf{m}_\mathrm{b}$ ). This lightweight MLP supports rapid batch computation, facilitating scalability for large compound libraries (Schneckenreiter et al., 14 Jan 2026).

In the conceptual extension from UniCorn, a specialized 4th “pocket-docked” geometric view can also be encoded through equivariant GNNs, further biasing the learned molecular representations toward binding-competent geometries (Feng et al., 2024).

3. Contrastive Objectives and Loss Designs

Joint training is conducted over alternating batches of structure-based (SB) and ligand-based (LB) data, reflecting diverse input availability in pharmacological datasets. The total loss is:

$\mathcal{L} = \mathcal{L}_{\mathrm{SB}} + \mathcal{L}_{\mathrm{LB}}$

Structure-Based Loss ( $\mathcal{L}_{\mathrm{SB}}$ ):

Geometric loss ( $\mathcal{L}_{\mathrm{geo}}$ ) for pocket-center regression, residue segmentation, and confidence estimation, identical to VN-EGNN’s original targets.
Multi-axis contrastive InfoNCE terms:
1. Protein + Pocket → Molecule ( $\mathcal{L}_{\mathrm{p2m}}$ ): aligns the concatenated protein and pocket embedding with the active ligand embedding.
2. Molecule → Protein ( $\mathcal{L}_{\mathrm{m2p}}$ ): aligns the ligand’s protein-matching subspace with the global protein embedding.
3. Molecule → Pocket ( $\mathcal{L}_{\mathrm{m2b}}$ ): aligns the ligand’s pocket-matching subspace with each pocket embedding; the pocket whose predicted center is closest to the ligand’s true binding site is prioritized.

These InfoNCE terms employ cosine similarity with dimensionally scaled softmax temperatures:

$\tau_{\mathrm{p2m}} = \frac{1}{\sqrt{2D}},\quad \tau_{\mathrm{m2p}} = \tau_{\mathrm{m2b}} = \frac{1}{\sqrt{D}}$

Ligand-Based Loss ( $\mathcal{L}_{\mathrm{LB}}$ ):

For bioactivity labels (active/inactive) in the absence of structure, a cross-entropy loss is applied over sigmoid-transformed protein–molecule cosine similarities, focusing model capacity on functional associations.

Integration with UniCorn’s Paradigm:

The loss structure draws upon UniCorn’s tri-partite objectives—fragment masking, torsion-denoising, and cross-modal contrastive learning—enabling transplant of pharmacophore-level masking, denoising against structural perturbations, and multi-modal alignment into the drug-design context (Feng et al., 2024).

4. Pocket Prediction and Inference Strategies

ConGLUDe supports several inference modes without requiring annotated pocket grids:

Zero-shot Virtual Screening: Ligands are ranked by global protein–ligand similarity, enabling target-agnostic screening.
Target Fishing: Ranking protein targets for a given ligand across large panels by similarity.
Unconditioned Pocket Prediction: Pocket candidates are ranked by geometric salience, using confidence outputs from the protein encoder.
Ligand-Conditioned Pocket Selection: For a query protein–ligand pair, all candidate pockets are scored by cosine similarity between their embeddings and the ligand’s pocket-matching subspace ( $s(\mathbf{b}_k,\mathbf{m}_\mathrm{b})$ ). This realizes rapid ligand-specific pocket identification at inference speeds orders of magnitude faster than docking-based approaches (Schneckenreiter et al., 14 Jan 2026).

A key operational advantage is the use of implicit, learned pocket embeddings enabling thousands of pocket assessments per second using similarity calculations, facilitating high-throughput screening and target deconvolution.

5. Training Regimes and Data Sources

The joint optimization scheme alternates between structure-labeled and ligand-labeled batches:

Structure-based training:
- PDBbind v2.0 (~25k protein–ligand complexes) and scPDB (>14k complexes) form the primary sources for virtual screening and pocket selection benchmarking.
- Binding pocket and site annotations are derived for geometric loss terms and contrastive alignment.
Ligand-based training:
- The MERGED set aggregates PubChem, BindingDB, and ChEMBL for a total of ∼56 million bioactivities across ≥3,526 proteins, filtered for sequence homology.
- Validation and zero-shot testing partitions are strictly decoupled.

Joint weight sharing between structure- and ligand-based encoders ensures translatability across both paradigms, a central characteristic distinguishing ConGLUDe from previous single-paradigm models (Schneckenreiter et al., 14 Jan 2026).

In proposed future extensions (as per UniCorn’s adaptable blueprint), joint training may also incorporate property-specific heads for affinity regression (e.g., $pK_d$ , $IC_{50}$ ), ADMET prediction, and activity cliff mining, as well as dataset augmentation through matched molecular pairs and pharmacophore-driven masking (Feng et al., 2024).

6. Evaluation Metrics and Quantitative Performance

Performance is assessed across a spectrum of industry-standard and large-scale benchmarks:

Task	Benchmark	ConGLUDe Result	Baseline Comparison
Virtual Screening (zero-shot, pocket-free)	DUD-E	AUROC 81.29, EF@1% 31.76	Matches DrugCLIP (with pocket input), SPRINT lower (Schneckenreiter et al., 14 Jan 2026)
Virtual Screening (assay-measured)	LIT-PCBA	AUROC 64.06, EF@1% 11.03	Outperforms pocket-agnostic baselines (Schneckenreiter et al., 14 Jan 2026)
Target Fishing	Kinobeads/MS	AUROC 65.6, EF@1% 9.9	Outperforms SPRINT (42.5), DiffDock (58.9)
Binding Site Prediction	COACH420, HOLO4K	0.602, 0.525 (Top-1 DCC@4Å)	Matches VN-EGNN
Ligand-Conditioned Pocket Selection	PDBbind Time, ASD	0.47, 0.29	Best baseline: 0.45, 0.33

Ablation studies demonstrate that removal of structure- or ligand-based batches, geometric supervision, or individual contrastive objectives each result in statistically significant performance degradation across tasks, confirming the multi-modal synergies realized by unified contrastive geometric learning.

This suggests that future improvements may lie in refining contrastive alignment strategies and extending property label coverage.

7. Implications, Extensions, and Limitations

By bridging SBDD and LBDD, ConGLUDe enables foundation-model functionality across drug discovery modalities:

Unified foundation encoder for pocket identification, screening, and target fishing.
Video-scale inference throughput; suitability for large-library screens and polypharmacology profiling.
No reliance on hand-designed pocket grids; supports “blind” (unannotated) structure inputs.
Transferability to downstream tasks, including affinity and ADMET prediction, via addition of property heads or generative modules.

Current limitations include uncertain performance on de novo predicted protein structures (e.g., AlphaFold models lacking fine-grained side-chain accuracy) and the absence of explicit ligand pose prediction. Extension to generative modeling or integration with diffusion-based pose samplers is tractable within the architecture. The demonstrated alignment with foundational multi-view representation learning approaches (e.g., UniCorn) suggests continuing opportunities for model fusion and transfer learning across chemical and structural drug-design modalities (Feng et al., 2024, Schneckenreiter et al., 14 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (2)

UniCorn: A Unified Contrastive Learning Approach for Multi-view Molecular Representation Learning (2024)

Contrastive Geometric Learning Unlocks Unified Structure- and Ligand-Based Drug Design (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Contrastive Geometric Learning for Unified Computational Drug Design (ConGLUDe).