- The paper introduces a calibrated abstention framework combining dual encoders, temperature scaling, and conformal prediction for robust TCR–pMHC binding prediction under epitope shift.
- The method demonstrates improved performance with an AUROC of 0.813 and a reduction in error rate from 18.7% to 10.9% at 80% coverage under shift-aware evaluations.
- It emphasizes selective prediction to mitigate overconfident mispredictions, achieving a 69.7% reduction in calibration error while enabling actionable risk management in experimental validations.
Calibrated Abstention for Reliable TCR--pMHC Binding Prediction under Epitope Shift
Introduction
Accurate prediction of T-cell receptor (TCR) and peptide-MHC (pMHC) binding is fundamental in computational immunology for applications such as vaccine design and T-cell-based therapies. Despite progress in the use of advanced neural architectures and protein LLMs for this task, practical deployments face pivotal challenges under epitope shift, where test-time targets (peptides/epitopes) are not represented in the training data. This scenario results in distributional shift, causing classifier miscalibration and unchecked overconfidence, with significant cost implications due to expensive downstream experimental validation.
This work reframes TCR--pMHC binding prediction as a selective prediction task. The key innovation is a calibrated abstention pipeline (CAP) that combines protein LLM-based dual encoders, post-hoc calibration, and conformal prediction to deliver robust coverage-risk trade-offs under distributional shift. The approach is rigorously evaluated under clinically relevant shift-aware evaluation regimes, highlighting methodological and practical implications for the broader computational immunology community.
Methodological Framework
Dual-Encoder Architecture
The model employs a dual-encoder design, where both TCR receptor (CDR3α and CDR3β) and peptide sequences are independently embedded using ESM-2, a state-of-the-art protein LLM with 650M parameters. Each sequence embedding is derived by mean-pooling the contextualized outputs, forming a fixed-size vector representation. These embeddings are concatenated and passed through a two-layer multi-layer perceptron (MLP) with GELU activations, layer normalization, and dropout, producing a binding probability via a sigmoid output layer.
Loss Function and Training Strategy
Due to a pronounced class imbalance (positive rate ∼4\%), training uses class-weighted binary cross-entropy loss, with the weights inversely proportional to class frequencies. This configuration is aligned with the detection-oriented nature of TCR--pMHC tasks, where discovery of rare positive binders is operationally critical.
Calibration via Temperature Scaling
Raw sigmoid outputs from neural networks are poorly calibrated under distribution shift. The method employs post-hoc temperature scaling on a matched calibration set, optimizing a single temperature parameter via negative log-likelihood minimization. Calibration is quantified by the Expected Calibration Error (ECE), and improvements are reported across all evaluation regimes.
To rigorously handle deployment scenarios where overconfident mispredictions are damaging, the approach employs a conformal prediction-based abstention rule. Nonconformity scores are computed on the calibration set, and a quantile-based threshold is determined such that, for a specified error tolerance ε, the empirical risk on retained predictions is guaranteed not to exceed ε up to confidence intervals adhering to finite-sample exchangeability assumptions. At test time, predictions whose nonconformity exceeds this threshold are abstained, making the model operationally selective as dictated by screening budget.
Evaluation Protocols
Beyond random splits, two shift-aware splits are rigorously applied:
- Epitope-held-out (EHO): All data involving selected test epitopes are held out, explicitly probing cross-epitope generalization.
- Distance-aware (DA): Test examples contain only TCRs with low sequence similarity to the training set, targeting receptor-level novelty.
Metrics include AUROC, AUPRC, ECE, Brier score, and coverage–risk curves, enabling multifaceted analysis of both discrimination and calibration performance.
Experimental Findings
Main Numerical Results
Under the EHO split, the CAP method achieves notable improvements: AUROC 0.813, ECE 0.043, and with a selective 80% coverage, a reduction in error rate from 18.7% to 10.9%. Relative to an uncalibrated baseline, ECE is reduced by 69.7%. Notably, these gains are robust under both EHO and DA shift-aware splits, in contrast to random splits that overestimate real-world discrimination (e.g., AUROC 0.871 on random splits versus 0.782 on epitope shift for the baseline).
Importantly, temperature scaling alone consistently reduces ECE with minor or negligible impact on discrimination metrics, separating the axes of calibration and discriminatory power. The addition of conformal abstention further improves metrics on the retained set (AUPRC, lower ECE), allowing actionable control over coverage–risk trade-offs.
Coverage–Risk Trade-off
Sweeping the abstention threshold demonstrates that error rate among retained predictions decreases smoothly as coverage is reduced. Specifically, moving from 100% to 80% coverage under EHO reduces error rate by 41.7% alongside a decrease in ECE, demonstrating that uncertainty quantification is meaningful and well-ordered. This directly supports informed prioritization under finite experimental budgets, providing actionable operational guidance.
Ablation and Sensitivity Analysis
Ablation studies identify the contribution of each pipeline component. Omission of temperature scaling leaves discrimination unchanged but reintroduces miscalibration (ECE increases). Removing class weighting materially degrades precision-oriented metrics (AUPRC), underscoring the importance of balancing for rare positives. Further, decreasing calibration set size below 2000 samples causes degradation of both nominal coverage and ECE, informing practical guidance on calibration resource allocation.
Discussion and Implications
The results substantiate several important claims:
- Discrimination evaluated on random splits fails to reflect real-world utility, with distribution-aware splits highlighting performance degradation under epitope shift. Standard metrics mask the operational risks inherent in deployment to novel targets.
- Calibration, as a model property, is not inherently tied to discrimination performance. Explicit calibration post-processing (temperature scaling) and conformal abstention together provide fine-grained coverage-risk control, more closely aligning algorithmic outputs with the requirements of wet-lab validation.
- The selective prediction framework enables principled risk management. By abstaining on high-uncertainty predictions, practitioners can decrease false positives among prioritized candidates, thereby aligning computational outputs with real-world resource constraints.
These findings have direct implications for benchmark design, method development, and reporting standards in AI for immunology. All future studies aiming at practical deployment should include shift-aware splits and report coverage–risk curves alongside aggregate discrimination and calibration metrics.
Limitations and Future Directions
Negative labels are constructed synthetically, which may introduce bias due to potential false negatives. The conformal coverage guarantee is marginal and not conditional per subgroup, suggesting robustness could be improved with calibration stratified by key covariates, such as HLA alleles. Label noise and incomplete experimental assays remain confounders.
There is a clear opportunity for further extension to richer, multi-allelic MHC contexts, more nuanced abstention strategies (e.g., hierarchical or budgeted selection), and incorporation of unsupervised or semi-supervised uncertainty quantification mechanisms. Prospective wet-lab validation of abstention-influenced predictions is a logical next step.
Conclusion
This work establishes that calibrated selective abstention, leveraging deep sequence encoders, temperature scaling, and conformal prediction, delivers reliable and actionable TCR--pMHC binding predictions under challenging epitope shift. The introduction of shift-aware benchmarks, holistic calibration reporting, and coverage–risk trade-off analysis sets a rigorous standard for future development and deployment-oriented evaluation in computational immunology.