A Systematic Study of Joint Representation Learning on Protein Sequences and Structures (2303.06275v2)
Abstract: Learning effective protein representations is critical in a variety of tasks in biology such as predicting protein functions. Recent sequence representation learning methods based on Protein LLMs (PLMs) excel in sequence-based tasks, but their direct adaptation to tasks involving protein structures remains a challenge. In contrast, structure-based methods leverage 3D structural information with graph neural networks and geometric pre-training methods show potential in function prediction tasks, but still suffers from the limited number of available structures. To bridge this gap, our study undertakes a comprehensive exploration of joint protein representation learning by integrating a state-of-the-art PLM (ESM-2) with distinct structure encoders (GVP, GearNet, CDConv). We introduce three representation fusion strategies and explore different pre-training techniques. Our method achieves significant improvements over existing sequence- and structure-based methods, setting new state-of-the-art for function annotation. This study underscores several important design choices for fusing protein sequence and structure information. Our implementation is available at https://github.com/DeepGraphLearning/ESM-GearNet.
- Unified rational protein engineering with sequence-based deep representation learning. Nature methods, 16(12): 1315–1322.
- Accurate prediction of protein structures and interactions using a three-track neural network. Science, 373(6557): 871–876.
- The protein data bank. Nucleic acids research, 28(1): 235–242.
- In Advances in Neural Information Processing Systems.
- xtrimopglm: Unified 100b-scale pre-trained transformer for deciphering the language of protein. bioRxiv, 2023–07.
- Structure-aware protein self-supervised learning. Bioinformatics, 39(4): btad189.
- A simple framework for contrastive learning of visual representations. In International conference on machine learning, 1597–1607. PMLR.
- DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking. International Conference on Learning Representations (ICLR).
- Robust deep learning–based protein sequence design using ProteinMPNN. Science, 378(6615): 49–56.
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171–4186. Minneapolis, Minnesota: Association for Computational Linguistics.
- Ankh: Optimized Protein Language Model Unlocks General-Purpose Modelling. bioRxiv, 2023–01.
- Ankh: Optimized Protein Language Model Unlocks General-Purpose Modelling. bioRxiv.
- Prottrans: Toward understanding the language of life through self-supervised learning. IEEE transactions on pattern analysis and machine intelligence, 44(10): 7112–7127.
- ProtTrans: Towards Cracking the Language of Lifes Code Through Self-Supervised Deep Learning and High Performance Computing. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1–1.
- Continuous-Discrete Convolution for Geometry-Sequence Modeling in Proteins. In The Eleventh International Conference on Learning Representations.
- Deciphering interaction fingerprints from protein molecular surfaces using geometric deep learning. Nature Methods, 17(2): 184–192.
- PiFold: Toward effective and efficient protein inverse folding. In The Eleventh International Conference on Learning Representations.
- Structure-based protein function prediction using graph convolutional networks. Nature communications, 12(1): 1–14.
- Self-Supervised Pre-training for Protein Embeddings Using Tertiary Structures. In AAAI.
- Pre-training Co-evolutionary Protein Representation via A Pairwise Masked Language Model. arXiv preprint arXiv:2110.15527.
- ProstT5: Bilingual Language Model for Protein Sequence and Structure. bioRxiv, 2023–07.
- Contrastive representation learning for 3d protein structures. arXiv preprint arXiv:2205.15675.
- Intrinsic-Extrinsic Convolution and Pooling for Learning on 3D Protein Structures. International Conference on Learning Representations.
- Rita: a study on scaling up generative protein sequence models. In 2022 ICML Workshop on Computational Biology.
- Learning inverse folding from millions of predicted structures. ICML.
- Data-Efficient Protein 3D Geometric Pretraining via Refinement of Diffused Protein Structure Decoy. arXiv preprint arXiv:2302.10888.
- Learning from Protein Structure with Geometric Vector Perceptrons. In International Conference on Learning Representations.
- Highly accurate protein structure prediction with AlphaFold. Nature, 596(7873): 583–589.
- Critical assessment of methods of protein structure prediction (CASP)—Round XIII. Proteins: Structure, Function, and Bioinformatics, 87(12): 1011–1020.
- Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379(6637): 1123–1130.
- Self-supervised contrastive learning of protein representations by mutual information maximization. BioRxiv.
- Large language models generate functional protein sequences across diverse families. Nature Biotechnology, 1–8.
- Language models enable zero-shot prediction of the effects of mutations on protein function. In Beygelzimer, A.; Dauphin, Y.; Liang, P.; and Vaughan, J. W., eds., Advances in Neural Information Processing Systems.
- Tranception: protein fitness prediction with autoregressive transformers and inference-time retrieval. In International Conference on Machine Learning, 16990–17017. PMLR.
- The natural history of protein domains. Annual review of biophysics and biomolecular structure, 31: 45–71.
- A large-scale evaluation of computational protein function prediction. Nature methods, 10(3): 221–227.
- Evaluating Protein Transfer Learning with TAPE. In Advances in Neural Information Processing Systems.
- Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proceedings of the National Academy of Sciences, 118(15).
- Is transfer learning necessary for protein landscape prediction? arXiv preprint arXiv:2011.03443.
- Fast end-to-end learning on protein surfaces. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 15272–15281.
- ATOM3D: Tasks on Molecules in Three Dimensions. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1).
- Diffusion Probabilistic Modeling of Protein Backbones in 3D for the motif-scaffolding problem. In The Eleventh International Conference on Learning Representations.
- AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic acids research.
- Attention is all you need. In Advances in neural information processing systems, 5998–6008.
- Lm-gvp: an extensible sequence and structure informed deep learning framework for protein property prediction. Scientific reports, 12(1): 6832.
- Multi-level Protein Structure Pre-training via Prompt Learning. In The Eleventh International Conference on Learning Representations.
- Enzyme nomenclature.
- Pre-training of Deep Protein Models with Molecular Dynamics Simulations for Drug Binding. arXiv preprint arXiv:2204.08663.
- Protein structure generation via folding diffusion. arXiv preprint arXiv:2209.15611.
- EurNet: Efficient Multi-Range Relational Modeling of Protein Structure. In ICLR 2023 - Machine Learning for Drug Discovery workshop.
- ProtST: Multi-Modality Learning of Protein Sequences and Biomedical Texts. In Krause, A.; Brunskill, E.; Cho, K.; Engelhardt, B.; Sabato, S.; and Scarlett, J., eds., Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, 38749–38767. PMLR.
- PEER: A Comprehensive and Multi-Task Benchmark for Protein Sequence Understanding. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
- OntoProtein: Protein Pretraining With Gene Ontology Embedding. In International Conference on Learning Representations.
- E3Bind: An End-to-End Equivariant Network for Protein-Ligand Docking. In The Eleventh International Conference on Learning Representations.
- Protein Representation Learning by Geometric Structure Pretraining. In First Workshop on Pre-training: Perspectives, Pitfalls, and Paths Forward at ICML 2022.
- Pre-Training Protein Encoder via Siamese Sequence-Structure Diffusion Trajectory Prediction. In Advances in Neural Information Processing Systems.
- TorchDrug: A Powerful and Flexible Machine Learning Platform for Drug Discovery. arXiv preprint arXiv:2202.08320.
- Zuobai Zhang (24 papers)
- Chuanrui Wang (4 papers)
- Minghao Xu (25 papers)
- Vijil Chenthamarakshan (36 papers)
- Aurélie Lozano (20 papers)
- Payel Das (104 papers)
- Jian Tang (327 papers)