Impact of phylogeny on the inference of functional sectors from protein sequence data (2405.04920v2)
Abstract: Statistical analysis of multiple sequence alignments of homologous proteins has revealed groups of coevolving amino acids called sectors. These groups of amino-acid sites feature collective correlations in their amino-acid usage, and they are associated to functional properties. Modeling showed that nonlinear selection on an additive functional trait of a protein is generically expected to give rise to a functional sector. These modeling results motivated a principled method, called ICOD, which is designed to identify functional sectors, as well as mutational effects, from sequence data. However, a challenge for all methods aiming to identify sectors from multiple sequence alignments is that correlations in amino-acid usage can also arise from the mere fact that homologous sequences share common ancestry, i.e. from phylogeny. Here, we generate controlled synthetic data from a minimal model comprising both phylogeny and functional sectors. We use this data to dissect the impact of phylogeny on sector identification and on mutational effect inference by different methods. We find that ICOD is most robust to phylogeny, but that conservation is also quite robust. Next, we consider natural multiple sequence alignments of protein families for which deep mutational scan experimental data is available. We show that in this natural data, conservation and ICOD best identify sites with strong functional roles, in agreement with our results on synthetic data. Importantly, these two methods have different premises, since they respectively focus on conservation and on correlations. Thus, their joint use can reveal complementary information.
- Correlated mutations and residue contacts in proteins. Proteins, 18(4):309–317, Apr 1994.
- Correlated mutations contain information about protein-protein interaction. J. Mol. Biol., 271(4):511–523, Aug 1997.
- Correlated mutations in models of protein sequences: phylogenetic and structural effects. In Statistics in molecular biology and genetics - IMS Lecture Notes - Monograph Series, volume 33, pages 236–256. 1999.
- Mutual information without the influence of phylogeny or entropy dramatically improves residue contact prediction. Bioinformatics, 24(3):333–340, Feb 2008.
- Rewiring the specificity of two-component signal transduction systems. Cell, 133(6):1043–1054, Jun 2008.
- L. Burger and E. van Nimwegen. Accurate prediction of protein-protein interactions from sequence alignments using a Bayesian method. Mol. Syst. Biol., 4:165, 2008.
- Identification of direct residue contacts in protein-protein interaction by message passing. Proc. Natl. Acad. Sci. U.S.A., 106(1):67–72, Jan 2009.
- Protein 3D structure computed from evolutionary sequence variation. PLoS ONE, 6(12):e28766, 2011.
- Direct-coupling analysis of residue coevolution captures native contacts across many protein families. Proc. Natl. Acad. Sci. U.S.A., 108(49):E1293–1301, Dec 2011.
- Genomics-aided structure prediction. Proc. Natl. Acad. Sci. U.S.A., 109(26):10340–10345, Jun 2012.
- Coevolutionary signals across protein lineages help capture multiple protein conformations. Proc. Natl. Acad. Sci. U.S.A., 110(51):20533–20538, Dec 2013.
- Predicting functionally informative mutations in Escherichia coli BamA using evolutionary covariance analysis. Genetics, 195(2):443–455, Oct 2013.
- Toward rationally redesigning bacterial two-component signaling systems using coevolutionary information. Proc. Natl. Acad. Sci. U.S.A., 111(5):E563–571, Feb 2014.
- Large-Scale Conformational Transitions and Dimerization Are Encoded in the Amino-Acid Sequences of Hsp70 Chaperones. PLoS Comput. Biol., 11(6):e1004262, Jun 2015.
- Inferring interaction partners from protein sequences. Proc. Natl. Acad. Sci. U.S.A., 113(43):12180–12185, 2016.
- Simultaneous identification of specifically interacting paralogs and interprotein contacts by direct coupling analysis. Proc. Natl. Acad. Sci. U.S.A., 113(43):12186–12191, 10 2016.
- Connecting the Sequence-Space of Bacterial Signaling Proteins to Phenotypes Using Coevolutionary Landscapes. Mol. Biol. Evol., 33(12):3054–3064, 12 2016.
- Coevolutionary Landscape Inference and the Context-Dependence of Mutations in Beta-Lactamase TEM-1. Mol. Biol. Evol., 33(1):268–280, Jan 2016.
- A multi-scale coevolutionary approach to predict interactions between protein domains. PLoS Comput Biol, 15(10):e1006891, 10 2019.
- Protein interaction networks revealed by proteome coevolution. Science, 365(6449):185–189, 07 2019.
- Epistatic contributions promote the unification of incompatible models of neutral molecular evolution. Proc Natl Acad Sci U S A, 117(11):5873–5882, 03 2020.
- An evolution-based model for designing chorismate mutase enzymes. Science, 369(6502):440–445, 07 2020.
- Large-scale discovery of protein interactions at residue resolution using co-evolution calculated from genomic sequences. Nat Commun, 12(1):1396, 03 2021.
- S. W. Lockless and R. Ranganathan. Evolutionarily conserved pathways of energetic connectivity in protein families. Science, 286(5438):295–299, Oct 1999.
- Evolutionarily conserved networks of residues mediate allosteric communication in proteins. Nat. Struct. Biol., 10(1):59–69, Jan 2003.
- Evolutionary information for specifying a protein fold. Nature, 437(7058):512, 2005.
- Protein sectors: evolutionary units of three-dimensional structure. Cell, 138(4):774–786, 2009.
- Coordinate linkage of HIV evolution reveals regions of immunological vulnerability. Proc. Natl. Acad. Sci. U.S.A., 108(28):11530–11535, Jul 2011.
- The spatial architecture of protein function and adaptation. Nature, 491(7422):138, 2012.
- Evolution-based functional decomposition of proteins. PLoS Comput. Biol., 12(6):e1004817, 2016.
- Revealing evolutionary constraints on proteins through sequence analysis. PLoS Comput Biol, 15(4):e1007010, 2019.
- Inferring the shape of global epistasis. Proc. Natl. Acad. Sci. U.S.A., 115(32):E7550–E7558, 08 2018.
- Missense meanderings in sequence space: a biophysical view of protein evolution. Nat. Rev. Genet., 6(9):678–687, Sep 2005.
- Epistasis in protein evolution. Protein Sci., 25(7):1204–1218, 07 2016.
- A method to predict functional residues in proteins. Nat. Struct. Biol., 2(2):171–178, Feb 1995.
- C. Qin and L. J. Colwell. Power law tails in phylogenetic systems. Proc. Natl. Acad. Sci. U.S.A., 115(4):690–695, Jan 2018.
- Synthetic protein alignments by CCMgen quantify noise in residue-residue contact prediction. PLOS Computational Biology, 14(11):1–25, 11 2018.
- Toward inferring potts models for phylogenetically correlated sequence data. Entropy, 21(11), 2019.
- On the effect of phylogenetic correlations in coevolution-based contact prediction in proteins. PLoS Comput Biol, 17(5), 2021.
- Impact of phylogeny on structural contact inference from protein sequence data. Journal of The Royal Society Interface, 20(199):20220707, Feb 2023.
- An evolutionary trace method defines binding surfaces common to protein families. J Mol Biol, 257(2):342–358, Mar 1996.
- Improved contact prediction in proteins: using pseudolikelihoods to infer Potts models. Phys. Rev. E, 87(1):012707, Jan 2013.
- Phylogenetic weighting does little to improve the accuracy of evolutionary coupling analyses. Entropy, 21(10), Oct 2019.
- D. Malinverni and A. Barducci. Coevolutionary Analysis of Protein Subfamilies by Sequence Reweighting. Entropy, 21(11):1127, Jan 2020.
- Extracting phylogenetic dimensions of coevolution reveals hidden functional signals. Scientific Reports, 12:820, 2022.
- Deep generative models of genetic variation capture the effects of mutations. Nature Methods, 15(10):816–822, Oct 2018.
- Correlations from structure and phylogeny combine constructively in the inference of protein partners from sequences. PLoS Comput Biol, 18(5):e1010147, 05 2022.
- Protein sectors: Statistical coupling analysis versus conservation. PLOS Computational Biology, 11(2):e1004091, Feb 2015.
- Phylogenetic correlations can suffice to infer protein partners from sequences. PLoS Comput. Biol., 15(10):e1007179, Oct 2019.
- Combining phylogeny and coevolution improves the inference of interaction partners among paralogous proteins. PLoS Comput. Biol., 19(3):e1011010, 2023.
- High-dimensional inference with the generalized Hopfield model: principal component analysis and corrections. Phys Rev E, 83(5 Pt 1):051123, May 2011.
- From principal component to direct coupling analysis of coevolution in proteins: Low-eigenvalue modes are needed for structure prediction. PLOS Comput. Biol., 9(8):e1003176, 2013.
- Mutation effects predicted from sequence co-variation. Nature biotechnology, 35(2):128, 2017.
- Epistatic models predict mutable sites in SARS-CoV-2 proteins and epitopes. Proc Natl Acad Sci U S A, 119(4), Jan 2022.
- pycofitness-Evaluating the fitness landscape of RNA and protein sequences. Bioinformatics, 40(2), Feb 2024.
- Field-theoretic density estimation for biological sequence space with applications to 5’ splice site diversity and aneuploidy in cancer. Proc Natl Acad Sci U S A, 118(40), Oct 2021.
- Higher-order epistasis and phenotypic prediction. Proc Natl Acad Sci U S A, 119(39):e2204233119, Sep 2022.
- Global dynamics of proteins: bridging between structure and function. Annu. Rev. Biophys., 39:23–42, 2010.
- Functional dynamics of PDZ binding domains: a normal-mode analysis. Biophys. J., 89(1):14–21, Jul 2005.
- M. Delarue and Y. H. Sanejouand. Simplified normal mode analysis of conformational transitions in DNA-dependent polymerases: the elastic network model. J. Mol. Biol., 320(5):1011–1024, Jul 2002.
- W. Zheng and S. Doniach. A comparative study of motor-protein motions by using a simple elastic-network model. Proc. Natl. Acad. Sci. U.S.A., 100(23):13253–13258, Nov 2003.
- Architecture and coevolution of allosteric materials. Proc. Natl. Acad. Sci. U.S.A., 114(10):2526–2531, 2017.
- Direct coupling analysis of epistasis in allosteric materials. PLoS Comput Biol, 16(3):e1007630, 03 2020.
- Low-frequency normal modes that describe allosteric transitions in biological nanomachines are robust to sequence variations. Proc. Natl. Acad. Sci. U.S.A., 103(20):7664–7669, May 2006.
- S. Lukman and G. H. Grant. A network of dynamically conserved residues deciphers the motions of maltose transporter. Proteins, 76(3):588–597, Aug 2009.
- Evolutionary Conserved Positions Define Protein Conformational Diversity. PLoS Comput. Biol., 12(3):e1004775, Mar 2016.
- Inferring couplings in networks across order-disorder phase transitions. Phys. Rev. Research, 4:023240, Jun 2022.
- Highly accurate protein structure prediction with AlphaFold. Nature, 596:583–589, 2021.
- Language models enable zero-shot prediction of the effects of mutations on protein function. Advances in Neural Information Processing Systems, 34:29287–29303, 2021.
- Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379(6637):1123–1130, 2023.
- Sean R. Eddy. Accelerated profile hmm searches. PLOS Computational Biology, 7(10):e1002195, Oct 2011.
- Evolutionarily conserved pathways of energetic connectivity in protein families. Science, 286(5438):295–299, Oct 1999.
- William Bialek. Biophysics: Searching for Principles. Princeton University Press, 2012.
- J. Sjöstrand. Singularités analytiques microlocales. Astérisque, 95:III–166, 1982.
- High-temperature expansions and message passing algorithms. Journal of Statistical Mechanics: Theory and Experiment, 2019(11):113301, 2019.