Papers
Topics
Authors
Recent
Search
2000 character limit reached

Contrastive losses as generalized models of global epistasis

Published 4 May 2023 in q-bio.PE and cs.LG | (2305.03136v4)

Abstract: Fitness functions map large combinatorial spaces of biological sequences to properties of interest. Inferring these multimodal functions from experimental data is a central task in modern protein engineering. Global epistasis models are an effective and physically-grounded class of models for estimating fitness functions from observed data. These models assume that a sparse latent function is transformed by a monotonic nonlinearity to emit measurable fitness. Here we demonstrate that minimizing supervised contrastive loss functions, such as the Bradley-Terry loss, is a simple and flexible technique for extracting the sparse latent function implied by global epistasis. We argue by way of a fitness-epistasis uncertainty principle that the nonlinearities in global epistasis models can produce observed fitness functions that do not admit sparse representations, and thus may be inefficient to learn from observations when using a Mean Squared Error (MSE) loss (a common practice). We show that contrastive losses are able to accurately estimate a ranking function from limited data even in regimes where MSE is ineffective and validate the practical utility of this insight by demonstrating that contrastive loss functions result in consistently improved performance on benchmark tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (44)
  1. Epistatic Net allows the sparse spectral regularization of deep neural networks for inferring fitness functions. Nature Communications, 12(1):5225, 2021.
  2. Spectral regularization allows data-frugal learning over combinatorial spaces, 2022.
  3. Combinatorial Genetics Reveals a Scaling Law for the Effects of Mutations on Splicing. Cell, 176(3):549—-563.e23, 2019.
  4. Idiosyncratic epistasis leads to global fitness–correlated trends. Science, 376(6593):630–635, 2022.
  5. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39:324, 1952.
  6. Conditioning by adaptive sampling for robust design. In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp.  773–782. PMLR, 09-15 Jun 2019.
  7. On the sparsity of fitness functions and implications for learning. Proceedings of the National Academy of Sciences of the United States of America, 119(1):e2109649118, 2022.
  8. Deep diversification of an AAV capsid protein by machine learning. Nature Biotechnology, pp.  1–6, 2021.
  9. Learning to rank using gradient descent. In Proceedings of the 22nd International Conference on Machine Learning, ICML ’05, pp.  89–96, New York, NY, USA, 2005. Association for Computing Machinery.
  10. MBE: model-based enrichment estimation and prediction for differential sequencing data. Genome Biology, 24(1):218, 2023.
  11. Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information. IEEE Transactions on Information Theory, 52(2):489–509, 2006. doi: 10.1109/TIT.2005.862083.
  12. Deep extrapolation for attribute-enhanced generation. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, 2021.
  13. Ranking Measures and Loss Functions in Learning to Rank. In Y Bengio, D Schuurmans, J Lafferty, C Williams, and A Culotta (eds.), Advances in Neural Information Processing Systems, volume 22. Curran Associates, Inc., 2009.
  14. Learning a Similarity Metric Discriminatively, with Application to Face Verification. 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), 1:539–546, 2005. doi: 10.1109/cvpr.2005.202.
  15. FLIP: Benchmark tasks in fitness landscape inference for proteins. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021.
  16. Information Theoretic Inequalities. IEEE Transactions on Information Theory, 37(6):1501–1518, 1991.
  17. Co-evolution of interacting proteins through non-contacting and non-specific mutations. Nature Ecology & Evolution, 6(5):590–603, 2022.
  18. Master regulators of evolution and the microbiome in higher dimensions, 2020. URL https://arxiv.org/abs/2009.12277.
  19. Neural networks to learn protein sequence–function relationships from deep mutational scanning data. Proceedings of the National Academy of Sciences, 118(48):e2104878118, 2021.
  20. Large Margin Rank Boundaries for Ordinal Regression. In Advances in Large Margin Classifiers, chapter 7, pp.  115–132. The MIT Press, 1999.
  21. Learning protein fitness models from evolutionary and assay-labeled data. Nature Biotechnology, 40(7):1114–1122, 2022.
  22. Physical Constraints on Epistasis. Molecular Biology and Evolution, 37(10):2865–2874, 2020.
  23. Meltome atlas—thermal proteome stability across the tree of life. Nature Methods, 17(5):495–503, 2020.
  24. The NK model of rugged fitness landscapes and its application to maturation of the immune response. Journal of Theoretical Biology, 141(2):211–245, 1989.
  25. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun (eds.), 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.
  26. Global epistasis makes adaptation predictable despite sequence-level stochasticity. Science, 344(6191):1519–1522, 2014.
  27. Tranception: Protein fitness prediction with autoregressive transformers and inference-time retrieval. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp.  16990–17017. PMLR, 17–23 Jul 2022.
  28. Jakub Otwinowski. Biophysical Inference of Epistasis and the Effects of Mutations on Protein Stability and Function. Molecular Biology and Evolution, 35(10):2345–2354, 2018.
  29. Inferring fitness landscapes by regression produces biased estimates of epistasis. Proceedings of the National Academy of Sciences, 111(22):E2301–E2309, 2014.
  30. Inferring the shape of global epistasis. Proceedings of the National Academy of Sciences, 115(32):E7550–E7558, 2018.
  31. Learning the pattern of epistasis linking genotype and phenotype in a protein. Nature Communications, 10(1):4213, 2019.
  32. An experimental assay of the interactions of amino acids from orthologous sequences shaping a complex fitness landscape. PLOS Genetics, 15(4):1–30, 2019.
  33. Global epistasis emerges from a generic model of a complex trait. Elife, 10:e64740, 2021.
  34. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proceedings of the National Academy of Sciences, 118(15):e2016239118, 2021. doi: 10.1073/pnas.2016239118.
  35. Detecting High-Order Epistasis in Nonlinear Genotype-Phenotype Maps. Genetics, 205(3):1079–1088, 2017.
  36. Local fitness landscape of the green fluorescent protein. Nature, 533(7603):397–401, 2016. ISSN 1476-4687.
  37. Peter F. Stadler. Towards a theory of landscapes. In Ramón López-Peña, Henri Waelbroeck, Riccardo Capovilla, Ricardo García-Pelayo, and Federico Zertuche (eds.), Complex Systems and Binary Networks, pp.  78–163, Berlin, Heidelberg, 1995. Springer Berlin Heidelberg.
  38. MAVE-NN: learning genotype-phenotype maps from multiplex assays of variant effect. Genome Biology, 23(1), 2022.
  39. E. D. Weinberger. Fourier and Taylor series on fitness landscapes. Biological Cybernetics, 65(5):321–330, 1991.
  40. Adaptation in protein fitness landscapes is facilitated by indirect paths. eLife, 5:e16965, 2016.
  41. Machine learning-assisted directed protein evolution with combinatorial libraries. Proceedings of the National Academy of Sciences, 116(18):8852–8858, 2019.
  42. Machine-learning-guided directed evolution for protein engineering. Nature Methods, 16(8):687–694, 2019.
  43. Minimum epistasis interpolation for sequence-function relationships. Nature Communications, 11(1):1782, 2020.
  44. Higher-order epistasis and phenotypic prediction. Proceedings of the National Academy of Sciences, 119(39):e2204233119, 2022.
Citations (3)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 52 likes about this paper.