Attribution of performance gains in ProtLM.TCR to HLA information versus data size

Ascertain whether the marginal (~2–4%) improvement in TCR–HLA class I epitope-binding predictions reported for the ProtLM.TCR model upon adding HLA information as a categorical variable is attributable to the inclusion of HLA features or confounded by the corresponding reduction in the total training data size.

Background

The ProtLM.TCR model was trained on TCRβ CDR3 sequences and evaluated on peptide binding tasks, with an additional experiment incorporating HLA class I information as a categorical feature. The inclusion of HLA data yielded a small performance gain.

However, the authors explicitly note uncertainty about whether the observed improvement stems from the added HLA information or from changes in dataset size, motivating a controlled analysis to isolate causal effects.

References

Interestingly, the authors also provided additional HLA information as a categorical variable that marginally improved binding predictions (~2-4%), although it remains unclear if this increase is related to the corresponding decrease in total data size60.

Learning immune receptor representations with protein language models (2402.03823 - Dounas et al., 6 Feb 2024) in TCR-specific protein language models