Distillation-Based Binding Affinity Predictor
- The paper introduces a framework that distills structural details from a teacher network to enhance the accuracy of a sequence-based binding affinity predictor.
- It employs both output-level and feature-level distillation losses, integrating them with supervised MSE loss to map fixed-length representations to binding affinities.
- The approach enables proteome-wide prediction in data-limited scenarios and is validated via LOCO cross-validation and external dataset evaluation.
A distillation-based binding affinity predictor is a supervised regression framework that employs knowledge distillation from a structure-informed teacher neural network to a sequence-only student model for protein–protein binding affinity prediction. This approach leverages abundant sequence data for inference while transferring "privileged" structural knowledge—available only during model training—to improve prediction accuracy, thereby bridging the applicability gap between sequence-based and structure-based machine learning models (Abbasi et al., 7 Jan 2026).
1. Model Architecture and Input Encoding
The framework integrates two fully-connected feed-forward neural regressors: a high-capacity teacher network (structure-informed) and a lightweight student network (sequence-only), both mapping fixed-length input representations to a scalar affinity regression output . Input vectors for both networks are standardized to zero mean and unit variance across features.
- Structure-Based Descriptors (teacher; input dimension varies):
- Dias et al. (26-D), Moal et al. (200-D), NIRP (211-D), Interface BLOSUM (20-D).
- Sequence-Based Descriptors (student; input dimension varies):
- k-mer (400-D), grouped k-mer (7), BLOSUM-62 (20-D), ProPy (1537-D), PSSM (20-D), ProtParam (7-D).
- Input Aggregation: Chain-level vectors are averaged over all chains of a complex, then ligand and receptor representations are concatenated.
The shared network backbone for both teacher and student is:
- Input layer (size )
- Hidden Layer-1: , activation: ReLU, dropout 0.3
- Hidden Layer-2: , activation: ReLU, dropout 0.2
- Distillation interface layer: size 16 (ReLU), yielding intermediate embedding
- Output: linear (predicts )
2. Knowledge Distillation Formulation
The supervised distillation paradigm employs both output-level and intermediate (feature-level) guidance:
Given a batch , with student (sequence-only) and teacher (structure-based) inputs:
- , (student)
- , (teacher)
- True affinity label:
Loss functions:
- Teacher supervised loss:
- Student supervised loss:
- Output-level distillation (prediction mimicry):
- Feature-level distillation (embedding mimicry):
- Composite student loss:
Typical weights: , , .
Pseudocode summary:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
for epoch in 1...E: for batch (xS, xT, y): # Forward pass hT, yT = teacher(xT) hS, yS = student(xS) # Loss computation L_sup_T = MSE(yT, y) L_sup_S = MSE(yS, y) L_out = MSE(yS, yT.detach()) L_feat = MSE(hS, hT.detach()) L_student = L_sup_S + α_out * L_out + α_feat * L_feat # Backpropagation teacher.zero_grad() L_sup_T.backward(retain_graph=True) teacher.step() student.zero_grad() L_student.backward() student.step() |
3. Training and Evaluation Protocol
- Primary dataset: Protein Binding Affinity Benchmark v2.0 (Kastritis et al. 2011), filtered to 128 non-redundant heterodimers.
- External validation: 39 complexes from Chen et al. (2013) with stringent length/chain criteria.
- Feature normalization: z-scored.
- Cross-validation: Leave-One-Complex-Out (LOCO). For each of 128 complexes, train on all others, test on the held-out example. Reported metrics are averaged over all folds, with 3 LOCO repetitions for robust estimation.
- Optimization: Adam (lr=1e-3, weight decay=1e-4), batch size 32, up to 100 epochs (with early stopping), Kaiming initialization, MSE as loss for all components.
4. Quantitative Performance and Analytical Results
Performance Metrics
- Pearson correlation coefficient ():
- RMSE:
Cross-Validation Results (LOCO, average over 128 folds)
| Model | Features | RMSE (kcal/mol) | |
|---|---|---|---|
| Sequence-only baseline | ProPy (1537-D) | 0.375 | 2.712 |
| Structure-based teacher | Moal (200-D) | 0.512 | 2.445 |
| Distilled student | MoalProPy | 0.481 | 2.488 |
External validation (39 complexes):
- Sequence-only: , RMSE=2.218 kcal/mol
- Distilled: , RMSE=2.025 kcal/mol
Analytical observations:
- Distillation substantially improves sequence-based predictor accuracy, narrowing the gap to 94% of the structure-based teacher ($0.481/0.512$), with only sequence inputs at prediction time.
- Error analyses demonstrate closer clustering to the identity line, tighter Bland–Altman intervals, lower bias, and less heteroscedasticity for the distilled student. All reported -values for correlations are .
5. Methodological Contributions and Practical Implications
- Demonstrates, for the first time, effective distillation of structural binding-affinity knowledge into a sequence-only regression model.
- Introduces a composite student objective combining ground-truth MSE, output mimicry (), and representation mimicry ().
- Provides a reproducible LOCO protocol and independent validation pipeline.
- Publicly releases source code and pretrained weights for inference: https://github.com/wajidarshad/ProteinAffinityKD (Abbasi et al., 7 Jan 2026).
This approach enables proteome-wide prediction in domains where experimentally determined or high-quality predicted structures are unavailable, by capturing and transferring structural "privileged information" into models requiring only sequence data at deployment.
6. Current Limitations and Prospective Extensions
- Training set size ( complexes) constrains transfer potential; experimental noise and lack of structural diversity contribute to an upper bound on attainable performance.
- Larger training corpora (e.g., PPB-Affinity), incorporation of multi-task learning, and advanced geometric or graph neural network encoders represent immediate frontiers for improvement.
- Leveraging pretrained sequence models (protein LLMs) as student backbones and expansion to other binding systems (e.g., small-molecule, antibody–antigen, or multi-component complexes) are practical future directions.
A plausible implication is that as protein structure prediction data and high-quality structural benchmarks become more abundant, the transfer of structural knowledge via distillation will further close the performance gap between structure-based and sequence-based predictors, making proteome-scale applications increasingly tractable (Abbasi et al., 7 Jan 2026).