Effects of immune receptor corpus design on learned representations and performance

Determine the effects of pre-training corpus design choices—including general proteins versus exclusively adaptive immune receptor sequences, full-length receptor sequences versus CDR3-only segments, inclusion of receptor–antigen interactions, and inclusion of single versus multiple species or individuals—on learned representations and downstream performance of protein language models applied to adaptive immune receptor tasks.

Background

The review raises unresolved questions about how corpus composition influences model behavior and outcomes for adaptive immunity applications. It specifically enumerates key corpus factors—domain scope, sequence granularity, interaction inclusion, and species/individual diversity—that may impact learned representations and prediction quality.

Systematic studies dissecting these factors are needed to establish best practices for corpus construction and to guide model development and deployment in immunology.

References

Many open questions remain regarding how the nature of the immune receptor corpus influences learned representations and model performance. For example, factors such as whether pre-training should be performed on general proteins or exclusively on immune receptors, full-length sequences versus CDR3s, receptor-antigen interactions are included, or multiple species or even individuals will influence downstream conclusions and predictions.

Learning immune receptor representations with protein language models (2402.03823 - Dounas et al., 6 Feb 2024) in Challenges and future perspectives