SPINDR Dataset for Ligand-Protein Complexes

Updated 23 September 2025

SPINDR dataset is a rigorously curated benchmark of 35,666 ligand–protein pocket complexes derived from crystallographic data, offering detailed atomic coordinates and interaction fingerprints.
It addresses prior limitations by filtering out artifacts and standardizing protonation states via an extensive data quality pipeline using tools like the Schrödinger Protein Preparation Wizard and ProLIF.
Integrated with the FLOWR framework, SPINDR facilitates efficient structure-aware ligand generation and evaluation using metrics such as PoseBusters-validity, strain energy, and docking scores.

The SPINDR dataset is a rigorously curated benchmark for ligand–protein pocket complexes, specifically designed for structure-based ligand generation, interaction recovery, and model evaluation in computational drug design. Developed alongside the FLOWR framework, SPINDR addresses data quality limitations found in predecessor datasets and supplies refined atomic coordinates and comprehensive interaction profiles critical for training and assessing advanced generative models (Cremer et al., 14 Apr 2025).

1. Dataset Composition and Organization

SPINDR (Small molecule Protein Interaction Dataset, Refined) comprises 35,666 high-quality ligand–pocket co-crystal complexes derived directly from crystallographic data sources. Each entry contains full 3D structural information, including the atomic coordinates of both the protein binding pocket and the bound ligand. The dataset preserves data splits identical to the original Plinder set, mitigating redundancy and preventing leakage between training, validation, and testing subsets.

SPINDR records not only atom and bond types but also refined protonation states. The interaction fingerprint between protein and ligand atoms is encoded as a binary matrix across 13 distinct interaction modes, calculated using the ProLIF tool. These interaction modes cover hydrogen bonding, π–π stacking, cation–π contacts, and other specific binding modalities. Each ligand atom is annotated with its chemical environment and interaction status, producing a detailed tensor $I \in \mathbb{N}^{N_{\text{prot}} \times N_{\text{lig}} \times 13}$ .

2. Data Quality Pipeline

Existing ligand–pocket datasets such as PDBBind and CrossDocked2020 are subject to systematic issues: covalently bound ligands, missing atoms, incorrect protonation, and steric clashes are prevalent. SPINDR employs an extensive filtering and refinement protocol:

Multi-ligand complexes and entries flagged as “oligo”, “ion”, “fragment”, “cofactor”, or “artifact” are removed in initial processing.
Protein structures are prepared with the Schrödinger Protein Preparation Wizard (OPLS 2005 force field), restoring missing atoms, converting non-standard residues, assigning correct protonation states, and locally minimizing energies.
The ProLIF interaction extractor calculates all atomistic interaction fingerprints at N_prot × N_lig × 13 detail.
Data splits retain the structure of the Plinder partitioning, guaranteeing that evaluation protocols conform to best practices in benchmark design.

This methodology assures reliability and consistency in pocket geometry, interaction mapping, and chemical accuracy, directly addressing common pitfalls in large-scale molecular datasets.

3. Technical Data Modalities

SPINDR adopts a multimodal data structure:

Continuous features: Atomic coordinates for protein and ligand atoms $(\mathbf{r} \in \mathbb{R}^{3 \times (N_{\text{prot}} + N_{\text{lig}})})$ .
Categorical features: Discrete atom types, formal charges, and bond orders.
Interaction fingerprints: For protein atom $j$ and ligand atom $i$ , $I_{j,i,k}$ signifies the presence ( $=1$ ) or absence ( $=0$ ) of interaction type $k$ . This supports atomic-level partitioning of the ligand, as illustrated by the mask $M_i = \mathbb{1} \left\{ \sum_{j=1}^{N_{\text{prot}}} \sum_{k=1}^{13} I_{j,i,k} > 0 \right\}$ for $i=1,\ldots,N_{\text{lig}}$ .

4. Integration with FLOWR Framework

SPINDR is central to the FLOWR (Flow Matching for Structure-Aware De Novo, Interaction- and Fragment-Based Ligand Generation) framework. The model:

Conditions generative flow matching processes jointly on pocket geometry, ligand structure, and optional interaction features.
Learns mapping from pocket and interaction profile to realistic, chemically-valid ligand conformations via joint coordinate denoising (MSE loss) and categorical prediction (cross-entropy loss).
Benefits from SPINDR's exhaustive interaction annotations, enabling conditioned sampling for fragment-based design and interaction recovery without retraining.

The inclusion of fine-grained interaction data facilitates the generation of ligands optimized for PoseBusters-validity, pose accuracy, reduced strain energies, and enhanced docking performance.

5. Evaluation Protocols and Metrics

Empirical evaluation on SPINDR reveals:

PoseBusters-validity: Quantifies the physical plausibility of generated ligand poses (as per PoseBusters criteria).
Strain energy: Assessed via GenBench3D and relaxation energies (using GFN2-xTB, ALPB solvation).
Docking metrics: Raw and minimized AutoDock-Vina scores indicate pose accuracy and binding affinity.
Geometric fidelity: Wasserstein distances between generated and SPINDR reference distributions for bond angles and lengths.

FLOWR achieves up to a 70-fold speedup in inference relative to prior methods, indicating both computational efficiency and quality of generated ligand structures.

6. Applications and Broader Significance

SPINDR serves as a robust benchmark for:

De novo ligand design conditioned on precise pocket and interaction information.
Scaffold hopping and fragment-based drug development, leveraging interaction-aware conditional models (e.g., FLOWR.multi).
Model evaluation in AI-driven drug discovery, providing reference-standard complexes for pose prediction, interaction recovery, and chemical property assessment.

The meticulously curated nature and rich annotation structure position SPINDR as a foundational resource for researchers seeking rigorous, high-fidelity datasets for both generative and predictive molecular modeling tasks.

7. Future Directions and Community Impact

Public release of SPINDR aims to promote reproducibility, benchmarking consistency, and methodological innovation. By rectifying common shortcomings in molecular datasets and supplying atomic-level interaction data within a uniform framework, SPINDR enables the development, evaluation, and deployment of state-of-the-art structure-aware ligand generation models. This supports both methodological advances and translational objectives in rational drug design, fragment-based strategies, and interaction-driven compound optimization (Cremer et al., 14 Apr 2025).

PDF Markdown Chat (Pro)

References (1)

FLOWR: Flow Matching for Structure-Aware De Novo, Interaction- and Fragment-Based Ligand Generation (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Spindr Dataset.