TerraBind: Diffusion-Free Protein-Ligand Prediction
- TerraBind is a diffusion-free model that uses coarse-grained representations to accurately predict protein–ligand poses and binding affinities.
- It integrates frozen encoders (COATI-3 and ESM-2) with a transformer-based pairformer to directly learn distograms at the ligand–protein interface.
- The approach achieves a 26.6× speedup over diffusion methods, enabling scalable virtual screening with state-of-the-art accuracy and calibrated uncertainty.
TerraBind is a foundation model for protein–ligand structure and binding affinity prediction employing a coarse structural representation to enable diffusion-free molecular pose generation and accurate, calibrated binding affinity prediction at substantially higher computational throughput relative to all-atom diffusion-based methods. Its core hypothesis is that structure-based affinity prediction does not necessitate atomic resolution of protein side-chains and all-atom diffusion of the complex, but can instead achieve state-of-the-art accuracy with a compact pocket representation and optimized multimodal neural architectures (Rossi et al., 8 Feb 2026).
1. Central Hypothesis and Structural Representation
TerraBind’s design is predicated on the hypothesis that “full all-atom diffusion is not required for accurate binding-pose or binding-affinity prediction.” The model employs a principled coarse-grained representation in which proteins are encoded by their residue Cβ atoms (with Cα used for glycine and no explicit side-chain atoms) and ligands by their heavy atoms only (hydrogens omitted). This approach is motivated by theoretical and empirical evidence that such coarse-graining retains sufficient interfacial geometry for binding prediction, while obviating the computational bottlenecks associated with all-atom 3D diffusion approaches such as Boltz-2.
By directly learning distance distributions (“distograms”) at the ligand–protein pocket interface, TerraBind avoids full coordinate generation during training and inference, focusing on predictive geometric features essential to binding without extraneous atomic detail.
2. Model Architecture and Component Modules
The TerraBind architecture consists of four primary modules:
- Frozen Encoders:
- COATI-3 provides ligand embeddings via three-modal contrastive pretraining across SMILES, 2D molecular graph, and 3D point cloud modalities, outputting per-atom features and a global ligand vector .
- ESM-2 supplies per-residue protein embeddings derived from masked language modeling over sequences.
- Structure-pairformer Trunk:
- Comprising 48 layers of triangle attention and triangle multiplication operating over all protein Cβ and ligand atom tokens, producing joint pairwise representations .
- The distogram head projects to logits over 64 distance bins .
- Diffusion-free Pose Module:
- Consumes the expected pocket–ligand pairwise distances and solves for 3D coordinates by minimizing the squared deviation between actual and predicted pairwise distances:
- The optimization proceeds via gradient descent (Adam, learning rate 1.0), initialized from Gaussian noise, repeated for 10 seeds with the lowest final loss selected.
Affinity Likelihood Module:
- The 6-layer affinity pairformer receives frozen pairwise latents, distogram probabilities, atom/residue embeddings, and the ligand global vector. Mean-pooling yields a complex-level vector .
- Prediction heads compute both binary binding probability and affinity regression (in ), with an epistemic neural network (“epinet”) generating a Gaussian approximate posterior for calibrated uncertainty.
3. Diffusion-Free Pose Recovery and Optimization
TerraBind’s pose prediction is entirely diffusion-free. Instead of generating all-atom coordinates via iterative Langevin dynamics or score-based diffusion, it reconstructs plausible 3D pocket–ligand geometry by fitting expected distance constraints via coordinate optimization.
The process entails:
- Extraction of expected distance matrices from pairformer logits.
- Random initialization of pocket and ligand atom positions.
- Numerical minimization of with Adam for up to 5,000 steps per sample.
- Generation of multiple candidate poses via random seeds, with the best (lowest-loss) solution retained.
This regime yields convergence within 0.2 seconds for 10 samples (N ≈ 200 tokens), compared to over 25 s for diffusion-based methods (on a single NVIDIA A6000 GPU, 196 tokens), resulting in a measured 26.6× inference speedup for pocket-level models.
4. Affinity Prediction, Uncertainty Quantification, and Continual Learning
Losses and Calibration
The affinity likelihood module is trained with a combined binary (focal loss on ) and regression (Huber loss on and optionally on intra-assay -affinities) loss. Epinet training introduces a learned residual atop the regression head, sampling a Gaussian latent to produce a posterior estimate for affinity, enabling marginal and joint uncertainty quantification:
This enables TerraBind to provide well-calibrated predictions, with >90% empirical success (within ±1 pIC₅₀) for low-uncertainty predictions and ≪50% for high-uncertainty predictions, as quantified via the interquartile range (IQR) of epinet samples.
Continual Learning and Batch Acquisition
TerraBind maintains a continual learning framework leveraging its epistemic uncertainty estimates. Upon updating with new observed affinities, candidate predictions are efficiently updated via the Matheron update rule applied to the Gaussian process–like ensemble, conditioning on new assay results.
For batched ligand selection, TerraBind introduces EMAX (expected maximum affinity) acquisition, explicitly optimizing:
This approach hedges against correlated prediction errors by discouraging redundant selection of similar molecules. In simulation, continual learning with EMAX demonstrated 6× greater maximum pIC₅₀ improvement over standard greedy acquisition in iterative drug design cycles on a held-out target.
5. Empirical Benchmarks and Performance Metrics
Comprehensive benchmarking was conducted over FoldBench (n=556), PoseBusters (n=307), Runs N’ Poses (n=2687), and proprietary datasets. Key results include:
Pose accuracy (ligand RMSD < 2 Å):
| Method | RMSD < 2 Å | RMSD < 2 Å & LDDT-PLI > 0.8 |
|---|---|---|
| Boltz-1 (diffusion) | 65.0% | 50.1% |
| Boltz-1 Trunk + opt. | 61.3% | 42.2% |
| TerraBind (full) | 67.4% | 54.3% |
| TerraBind Pocket | 65.9% | 52.7% |
Affinity prediction (Pearson r):
| Dataset | Boltz-2 | TerraBind | Δr |
|---|---|---|---|
| CASP16 (2 targets) | 0.60 | 0.70 | +0.10 |
| Proprietary (18 targets) | 0.50 | 0.60 | +0.10 |
| Aggregate | 0.55 | 0.66 | +0.11 |
Inference time (NVIDIA A6000, 196 tokens, 10 samples):
- Boltz-2: 27.8 s
- TerraBind Pocket: 1.045 s
TerraBind’s RMSD and LDDT-PLI pose success rates marginally exceed diffusion-based baselines, with Pearson r improvements of ≈ 20% for affinity prediction. RMSE for affinity is also improved by ≈10–15%. The typical end-to-end speedup is 26.6× relative to diffusion methods, with screening capability approaching 10⁶–10⁷ compounds per GPU-day.
6. Practical Considerations, Limitations, and Prospects
TerraBind’s coarse yet expressive representation enables high-throughput virtual screening at scale, with pocket contexts typically under 150 tokens, controlling cubic pairformer scaling and GPU memory utilization. This computational efficiency makes billion-scale screening plausible with current hardware, enabling practical industrial application.
Notable limitations include unsuitability for pipelines demanding all-atom side chains for downstream physical refinement, reduced reliability for highly out-of-distribution ligands (which may induce high uncertainty but still be active—“activity cliffs”), and the epinet’s Gaussian marginals, which can underrepresent rare or skewed affinity error modes.
Ongoing and future directions include the incorporation of synthetic/physics-driven data to expand the ligand chemical space, development of assay-conditioned affinity prediction heads to adapt to diverse experimental contexts, and extension of epinet priors to accommodate non-Gaussian distributions (e.g., bounded or skewed marginals).
In summary, TerraBind demonstrates that carefully co-designed coarse representations, multimodal encoders (COATI-3, ESM-2), transformer-based structural integration, diffusion-free pose reconstruction, and calibrated epinet-based uncertainty allow matching or exceeding the accuracy of diffusion models at ≤4% of their computational cost for both pose and affinity prediction, removing key obstacles for industrial-scale virtual screening (Rossi et al., 8 Feb 2026).