Distillation-Based Binding Affinity Predictor

Updated 8 January 2026

The paper introduces a framework that distills structural details from a teacher network to enhance the accuracy of a sequence-based binding affinity predictor.
It employs both output-level and feature-level distillation losses, integrating them with supervised MSE loss to map fixed-length representations to binding affinities.
The approach enables proteome-wide prediction in data-limited scenarios and is validated via LOCO cross-validation and external dataset evaluation.

A distillation-based binding affinity predictor is a supervised regression framework that employs knowledge distillation from a structure-informed teacher neural network to a sequence-only student model for protein–protein binding affinity prediction. This approach leverages abundant sequence data for inference while transferring "privileged" structural knowledge—available only during model training—to improve prediction accuracy, thereby bridging the applicability gap between sequence-based and structure-based machine learning models (Abbasi et al., 7 Jan 2026).

1. Model Architecture and Input Encoding

The framework integrates two fully-connected feed-forward neural regressors: a high-capacity teacher network (structure-informed) and a lightweight student network (sequence-only), both mapping fixed-length input representations to a scalar affinity regression output $\hat{y}$ . Input vectors for both networks are standardized to zero mean and unit variance across features.

Structure-Based Descriptors (teacher; input dimension $d_T$ varies):
- Dias et al. (26-D), Moal et al. (200-D), NIRP (211-D), Interface BLOSUM (20-D).
Sequence-Based Descriptors (student; input dimension $d_S$ varies):
- k-mer (400-D), grouped k-mer (7 $^k$ ), BLOSUM-62 (20-D), ProPy (1537-D), PSSM (20-D), ProtParam (7-D).
Input Aggregation: Chain-level vectors are averaged over all chains of a complex, then ligand and receptor representations are concatenated.

The shared network backbone for both teacher and student is:

Input layer (size $d$ )
Hidden Layer-1: $h_1 = \min(512, \max(64, \lfloor d/8 \rfloor))$ , activation: ReLU, dropout 0.3
Hidden Layer-2: $h_2 = \min(128, \lfloor h_1/2 \rfloor)$ , activation: ReLU, dropout 0.2
Distillation interface layer: size 16 (ReLU), yielding intermediate embedding $h$
Output: linear $16 \to 1$ (predicts $\hat{y}$ )

2. Knowledge Distillation Formulation

The supervised distillation paradigm employs both output-level and intermediate (feature-level) guidance:

Given a batch $\{(x^S_i, x^T_i, y_i)\}_{i=1}^N$ , with student (sequence-only) and teacher (structure-based) inputs:

$h^S_i = U_s(x^S_i)$ , $\hat{y}^S_i = f^S(x^S_i)$ (student)
$h^T_i = U_t(x^T_i)$ , $\hat{y}^T_i = f^T(x^T_i)$ (teacher)
True affinity label: $y_i$

Loss functions:

Teacher supervised loss:

$\mathcal{L}_{\mathrm{sup}^T} = \frac{1}{N} \sum_{i=1}^N (\hat{y}^T_i - y_i)^2$

Student supervised loss:

$\mathcal{L}_{\mathrm{sup}^S} = \frac{1}{N} \sum_{i=1}^N (\hat{y}^S_i - y_i)^2$

Output-level distillation (prediction mimicry):

$\mathcal{L}_{\mathrm{out}} = \frac{1}{N} \sum_{i=1}^N (\hat{y}^S_i - \hat{y}^T_i)^2$

Feature-level distillation (embedding mimicry):

$\mathcal{L}_{\mathrm{feat}} = \frac{1}{N} \sum_{i=1}^N \left\| h^S_i - h^T_i \right\|_2^2$

Composite student loss:

$\mathcal{L}_{\mathrm{student}} = \alpha_{\mathrm{sup}} \mathcal{L}_{\mathrm{sup}^S} + \alpha_{\mathrm{out}} \mathcal{L}_{\mathrm{out}} + \alpha_{\mathrm{feat}} \mathcal{L}_{\mathrm{feat}}$

Typical weights: $\alpha_{\mathrm{sup}}=1.0$ , $\alpha_{\mathrm{out}}=0.6$ , $\alpha_{\mathrm{feat}}=0.5$ .

Pseudocode summary:

for epoch in 1...E:
    for batch (xS, xT, y):
        # Forward pass
        hT, yT = teacher(xT)
        hS, yS = student(xS)

        # Loss computation
        L_sup_T  = MSE(yT, y)
        L_sup_S  = MSE(yS, y)
        L_out    = MSE(yS, yT.detach())
        L_feat   = MSE(hS, hT.detach())
        L_student = L_sup_S + α_out * L_out + α_feat * L_feat

        # Backpropagation
        teacher.zero_grad()
        L_sup_T.backward(retain_graph=True)
        teacher.step()

        student.zero_grad()
        L_student.backward()
        student.step()

Gradients from

\mathcal{L}_{\mathrm{feat}}

and

\mathcal{L}_{\mathrm{out}}

are blocked from propagating into the teacher. Both networks are updated jointly (the teacher is not fixed during training).

3. Training and Evaluation Protocol

Primary dataset: Protein Binding Affinity Benchmark v2.0 (Kastritis et al. 2011), filtered to 128 non-redundant heterodimers.
External validation: 39 complexes from Chen et al. (2013) with stringent length/chain criteria.
Feature normalization: z-scored.
Cross-validation: Leave-One-Complex-Out (LOCO). For each of 128 complexes, train on all others, test on the held-out example. Reported metrics are averaged over all folds, with 3 LOCO repetitions for robust estimation.
Optimization: Adam (lr=1e-3, weight decay=1e-4), batch size 32, up to 100 epochs (with early stopping), Kaiming initialization, MSE as loss for all components.

4. Quantitative Performance and Analytical Results

Performance Metrics

Pearson correlation coefficient ( $P_r$ ):

$P_r = \frac{\sum_i (y_i-\bar{y})(\hat{y}_i-\overline{\hat{y}})} {\sqrt{\sum_i (y_i-\bar{y})^2} \sqrt{\sum_i (\hat{y}_i-\overline{\hat{y}})^2}}$

RMSE:

$\text{RMSE} = \sqrt{\frac{1}{N} \sum_{i=1}^N (\hat{y}_i - y_i)^2}$

Cross-Validation Results (LOCO, average over 128 folds)

Model	Features	$P_r$	RMSE (kcal/mol)
Sequence-only baseline	ProPy (1537-D)	0.375	2.712
Structure-based teacher	Moal (200-D)	0.512	2.445
Distilled student	Moal $\to$ ProPy	0.481	2.488

External validation (39 complexes):

Sequence-only: $P_r=0.317$ , RMSE=2.218 kcal/mol
Distilled: $P_r=0.429$ , RMSE=2.025 kcal/mol

Analytical observations:

Distillation substantially improves sequence-based predictor accuracy, narrowing the $P_r$ gap to 94% of the structure-based teacher ($0.481/0.512$), with only sequence inputs at prediction time.
Error analyses demonstrate closer clustering to the identity line, tighter Bland–Altman intervals, lower bias, and less heteroscedasticity for the distilled student. All reported $P$ -values for correlations are $<0.01$ .

5. Methodological Contributions and Practical Implications

Demonstrates, for the first time, effective distillation of structural binding-affinity knowledge into a sequence-only regression model.
Introduces a composite student objective combining ground-truth MSE, output mimicry ( $\mathcal{L}_{\mathrm{out}}$ ), and representation mimicry ( $\mathcal{L}_{\mathrm{feat}}$ ).
Provides a reproducible LOCO protocol and independent validation pipeline.
Publicly releases source code and pretrained weights for inference: https://github.com/wajidarshad/ProteinAffinityKD (Abbasi et al., 7 Jan 2026).

This approach enables proteome-wide prediction in domains where experimentally determined or high-quality predicted structures are unavailable, by capturing and transferring structural "privileged information" into models requiring only sequence data at deployment.

6. Current Limitations and Prospective Extensions

Training set size ( $\sim128$ complexes) constrains transfer potential; experimental noise and lack of structural diversity contribute to an upper bound on attainable performance.
Larger training corpora (e.g., PPB-Affinity), incorporation of multi-task learning, and advanced geometric or graph neural network encoders represent immediate frontiers for improvement.
Leveraging pretrained sequence models (protein LLMs) as student backbones and expansion to other binding systems (e.g., small-molecule, antibody–antigen, or multi-component complexes) are practical future directions.

A plausible implication is that as protein structure prediction data and high-quality structural benchmarks become more abundant, the transfer of structural knowledge via distillation will further close the performance gap between structure-based and sequence-based predictors, making proteome-scale applications increasingly tractable (Abbasi et al., 7 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Investigating Knowledge Distillation Through Neural Networks for Protein Binding Affinity Prediction (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Distillation-Based Binding Affinity Predictor.