Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Systematic Study of Joint Representation Learning on Protein Sequences and Structures (2303.06275v2)

Published 11 Mar 2023 in q-bio.QM and cs.LG

Abstract: Learning effective protein representations is critical in a variety of tasks in biology such as predicting protein functions. Recent sequence representation learning methods based on Protein LLMs (PLMs) excel in sequence-based tasks, but their direct adaptation to tasks involving protein structures remains a challenge. In contrast, structure-based methods leverage 3D structural information with graph neural networks and geometric pre-training methods show potential in function prediction tasks, but still suffers from the limited number of available structures. To bridge this gap, our study undertakes a comprehensive exploration of joint protein representation learning by integrating a state-of-the-art PLM (ESM-2) with distinct structure encoders (GVP, GearNet, CDConv). We introduce three representation fusion strategies and explore different pre-training techniques. Our method achieves significant improvements over existing sequence- and structure-based methods, setting new state-of-the-art for function annotation. This study underscores several important design choices for fusing protein sequence and structure information. Our implementation is available at https://github.com/DeepGraphLearning/ESM-GearNet.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (57)
  1. Unified rational protein engineering with sequence-based deep representation learning. Nature methods, 16(12): 1315–1322.
  2. Accurate prediction of protein structures and interactions using a three-track neural network. Science, 373(6557): 871–876.
  3. The protein data bank. Nucleic acids research, 28(1): 235–242.
  4. In Advances in Neural Information Processing Systems.
  5. xtrimopglm: Unified 100b-scale pre-trained transformer for deciphering the language of protein. bioRxiv, 2023–07.
  6. Structure-aware protein self-supervised learning. Bioinformatics, 39(4): btad189.
  7. A simple framework for contrastive learning of visual representations. In International conference on machine learning, 1597–1607. PMLR.
  8. DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking. International Conference on Learning Representations (ICLR).
  9. Robust deep learning–based protein sequence design using ProteinMPNN. Science, 378(6615): 49–56.
  10. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171–4186. Minneapolis, Minnesota: Association for Computational Linguistics.
  11. Ankh: Optimized Protein Language Model Unlocks General-Purpose Modelling. bioRxiv, 2023–01.
  12. Ankh: Optimized Protein Language Model Unlocks General-Purpose Modelling. bioRxiv.
  13. Prottrans: Toward understanding the language of life through self-supervised learning. IEEE transactions on pattern analysis and machine intelligence, 44(10): 7112–7127.
  14. ProtTrans: Towards Cracking the Language of Lifes Code Through Self-Supervised Deep Learning and High Performance Computing. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1–1.
  15. Continuous-Discrete Convolution for Geometry-Sequence Modeling in Proteins. In The Eleventh International Conference on Learning Representations.
  16. Deciphering interaction fingerprints from protein molecular surfaces using geometric deep learning. Nature Methods, 17(2): 184–192.
  17. PiFold: Toward effective and efficient protein inverse folding. In The Eleventh International Conference on Learning Representations.
  18. Structure-based protein function prediction using graph convolutional networks. Nature communications, 12(1): 1–14.
  19. Self-Supervised Pre-training for Protein Embeddings Using Tertiary Structures. In AAAI.
  20. Pre-training Co-evolutionary Protein Representation via A Pairwise Masked Language Model. arXiv preprint arXiv:2110.15527.
  21. ProstT5: Bilingual Language Model for Protein Sequence and Structure. bioRxiv, 2023–07.
  22. Contrastive representation learning for 3d protein structures. arXiv preprint arXiv:2205.15675.
  23. Intrinsic-Extrinsic Convolution and Pooling for Learning on 3D Protein Structures. International Conference on Learning Representations.
  24. Rita: a study on scaling up generative protein sequence models. In 2022 ICML Workshop on Computational Biology.
  25. Learning inverse folding from millions of predicted structures. ICML.
  26. Data-Efficient Protein 3D Geometric Pretraining via Refinement of Diffused Protein Structure Decoy. arXiv preprint arXiv:2302.10888.
  27. Learning from Protein Structure with Geometric Vector Perceptrons. In International Conference on Learning Representations.
  28. Highly accurate protein structure prediction with AlphaFold. Nature, 596(7873): 583–589.
  29. Critical assessment of methods of protein structure prediction (CASP)—Round XIII. Proteins: Structure, Function, and Bioinformatics, 87(12): 1011–1020.
  30. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379(6637): 1123–1130.
  31. Self-supervised contrastive learning of protein representations by mutual information maximization. BioRxiv.
  32. Large language models generate functional protein sequences across diverse families. Nature Biotechnology, 1–8.
  33. Language models enable zero-shot prediction of the effects of mutations on protein function. In Beygelzimer, A.; Dauphin, Y.; Liang, P.; and Vaughan, J. W., eds., Advances in Neural Information Processing Systems.
  34. Tranception: protein fitness prediction with autoregressive transformers and inference-time retrieval. In International Conference on Machine Learning, 16990–17017. PMLR.
  35. The natural history of protein domains. Annual review of biophysics and biomolecular structure, 31: 45–71.
  36. A large-scale evaluation of computational protein function prediction. Nature methods, 10(3): 221–227.
  37. Evaluating Protein Transfer Learning with TAPE. In Advances in Neural Information Processing Systems.
  38. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proceedings of the National Academy of Sciences, 118(15).
  39. Is transfer learning necessary for protein landscape prediction? arXiv preprint arXiv:2011.03443.
  40. Fast end-to-end learning on protein surfaces. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 15272–15281.
  41. ATOM3D: Tasks on Molecules in Three Dimensions. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1).
  42. Diffusion Probabilistic Modeling of Protein Backbones in 3D for the motif-scaffolding problem. In The Eleventh International Conference on Learning Representations.
  43. AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic acids research.
  44. Attention is all you need. In Advances in neural information processing systems, 5998–6008.
  45. Lm-gvp: an extensible sequence and structure informed deep learning framework for protein property prediction. Scientific reports, 12(1): 6832.
  46. Multi-level Protein Structure Pre-training via Prompt Learning. In The Eleventh International Conference on Learning Representations.
  47. Enzyme nomenclature.
  48. Pre-training of Deep Protein Models with Molecular Dynamics Simulations for Drug Binding. arXiv preprint arXiv:2204.08663.
  49. Protein structure generation via folding diffusion. arXiv preprint arXiv:2209.15611.
  50. EurNet: Efficient Multi-Range Relational Modeling of Protein Structure. In ICLR 2023 - Machine Learning for Drug Discovery workshop.
  51. ProtST: Multi-Modality Learning of Protein Sequences and Biomedical Texts. In Krause, A.; Brunskill, E.; Cho, K.; Engelhardt, B.; Sabato, S.; and Scarlett, J., eds., Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, 38749–38767. PMLR.
  52. PEER: A Comprehensive and Multi-Task Benchmark for Protein Sequence Understanding. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
  53. OntoProtein: Protein Pretraining With Gene Ontology Embedding. In International Conference on Learning Representations.
  54. E3Bind: An End-to-End Equivariant Network for Protein-Ligand Docking. In The Eleventh International Conference on Learning Representations.
  55. Protein Representation Learning by Geometric Structure Pretraining. In First Workshop on Pre-training: Perspectives, Pitfalls, and Paths Forward at ICML 2022.
  56. Pre-Training Protein Encoder via Siamese Sequence-Structure Diffusion Trajectory Prediction. In Advances in Neural Information Processing Systems.
  57. TorchDrug: A Powerful and Flexible Machine Learning Platform for Drug Discovery. arXiv preprint arXiv:2202.08320.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Zuobai Zhang (24 papers)
  2. Chuanrui Wang (4 papers)
  3. Minghao Xu (25 papers)
  4. Vijil Chenthamarakshan (36 papers)
  5. Aurélie Lozano (20 papers)
  6. Payel Das (104 papers)
  7. Jian Tang (327 papers)
Citations (20)

Summary

Formatting Guidelines for Anonymous Submissions to AAAI Proceedings

The document under review is a comprehensive guide for authors preparing anonymous submissions to AAAI (Association for the Advancement of Artificial Intelligence) Press proceedings, technical reports, and other publications. It standardizes the formatting requirements needed to ensure uniformity and professionalism in AAAI submissions using \LaTeX{}. Below, we summarize the key components of the paper along with its implications for authors.

Overview of Submission and Formatting Criteria

This paper details essential formatting guidelines authors must follow to prepare their documents for anonymous submission. The guidelines are structured to maintain the integrity and consistency of presented work, ensuring all submissions adhere to a standardized appearance. Key requirements include:

  • Anonymity: As anonymous submissions require author identity concealment, authors are instructed to list "Anonymous Submission" as the author and clear metadata from their PDF files.
  • Conformance with Style File: Authors must adopt the AAAI style file that handles document layout, font, and size constraints automatically. Authors are prohibited from modifying the style file or employing commands that alter document appearance decisions such as spacing, heading formats, etc.
  • Technical Specifications: Detailed itemization on paper size, column widths, margins, and the prohibition of certain style files or packages that may inadvertent formatting changes or layout disturbances is provided.

Detailed Sections and Instructions

The paper breaks down the formatting into several detailed sections, encompassing every aspect of paper preparation, including sections on copyrights, illustrations, and acceptable font choices. Notably:

  1. Electronic File Requirements: Submissions must be in US letter size, formatted in two columns, with all fonts embedded including figures.
  2. Metadata and Bibliography: The document recommends the use of the BibTeX system with a specified style for references, ensuring consistency in citation formatting.
  3. Illustration Specifications: It elaborates on image and figure standards, disallowing certain file types and mandating quality resolutions to ensure readability in published articles.

Technological Considerations and Constraints

There is an emphasis on compatibility and technical constraints authors might encounter using \LaTeX{}. For example, the paper guards against using Type 3 fonts due to compatibility limitations, advising instead the use of more robust Type 1 fonts when necessary. Authors are forewarned about illegibility and device compatibility issues dealing with images, further advising on the use of proper graphic programs to prepare images outside \LaTeX{}.

Implications and Future Considerations

The guidelines presented in this document are critical for authors in achieving the required standardization in AAAI publications, significantly easing the production line and maintaining uniformity across publications. This standardization not only advances efficiency within AAAI but theoretically aids recognition and readability universally.

As academic publications move towards more automated submissions and formatting checks, the insights within this document guide how authors can adapt their practices to meet future technological demands and policies in AI publications. Although \LaTeX{} remains the primary tool, authors might anticipate evolving standards encompassing more advanced formatting software and potentially automated correction tools to support or replace manual proofing tasks currently required pre-publication.

In summary, this paper sets the benchmark for preparing AAAI submissions, with an eye towards efficient, readable, and consistent documentation practices across AI research dissemination platforms.