Learning the Language of Protein Structure (2405.15840v1)
Abstract: Representation learning and \emph{de novo} generation of proteins are pivotal computational biology tasks. Whilst NLP techniques have proven highly effective for protein sequence modelling, structure modelling presents a complex challenge, primarily due to its continuous and three-dimensional nature. Motivated by this discrepancy, we introduce an approach using a vector-quantized autoencoder that effectively tokenizes protein structures into discrete representations. This method transforms the continuous, complex space of protein structures into a manageable, discrete format with a codebook ranging from 4096 to 64000 tokens, achieving high-fidelity reconstructions with backbone root mean square deviations (RMSD) of approximately 1-5 \AA. To demonstrate the efficacy of our learned representations, we show that a simple GPT model trained on our codebooks can generate novel, diverse, and designable protein structures. Our approach not only provides representations of protein structure, but also mitigates the challenges of disparate modal representations and sets a foundation for seamless, multi-modal integration, enhancing the capabilities of computational methods in protein design.
- Gpt-4 technical report. arXiv, 2023.
- Self-supervised multimodal versatile networks. In Advances in Neural Information Processing Systems, 2020.
- Flamingo: a visual language model for few-shot learning. In Advances in neural information processing systems, 2022.
- Accurate prediction of protein structures and interactions using a three-track neural network. Science, 2021.
- "The Protein Data Bank". Nucleic Acids Research, 2000.
- Discrete graph auto-encoder. Transactions on Machine Learning Research, 2024.
- PASTA: Pretrained action-state transformer agents. In NeurIPS 2023 Foundation Models for Decision Making Workshop, 2023.
- SE(3)-stochastic flow matching for protein backbone generation. In International Conference on Learning Representations, 2024.
- Transformer-based deep learning for predicting protein properties in the life sciences. Elife, 2023.
- MaskGIT: Masked generative image transformer. In Conference on Computer Vision and Pattern Recognition, 2022.
- Muse: Text-to-image generation via masked generative transformers. In Proceedings of the 40th International Conference on Machine Learning, 2023.
- Decision transformer: Reinforcement learning via sequence modeling. arXiv, 2021.
- Generative pretraining from pixels. In International Conference on Machine Learning, 2020.
- Gene Ontology Consortium. "the gene ontology (GO) project in 2006". Nucleic Acids Residues, 2006.
- Simple and controllable music generation. In Advances in Neural Information Processing Systems, 2023.
- Robust deep learning–based protein sequence design using proteinmpnn. Science, 2022.
- High fidelity neural audio compression. Transactions on Machine Learning Research, 2023.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In North American Chapter of the Association for Computational Linguistics, 2019.
- "Ig-VAE: Generative modeling of protein structure by direct 3D coordinate generation". "PLoS Computational Biology", 2022.
- Protgpt2 is a deep unsupervised language model for protein design. Nature Communications, 2022.
- Independent SE(3)-equivariant models for end-to-end rigid protein docking. In International Conference on Learning Representations, 2022.
- VQPL: Vector quantized protein language. arXiv, 2023.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
- Structure-based protein function prediction using graph convolutional networks. Nature Communications, 2021.
- A. Herbert and M. Sternberg. MaxCluster: a tool for protein structure comparison and clustering. arXiv, 2008.
- An empirical analysis of compute-optimal large language model training. In Advances in Neural Information Processing Systems, 2022.
- The curious case of neural text degeneration. In International Conference on Learning Representations, 2020.
- Learning inverse folding from millions of predicted structures. In International Conference on Machine Learning, 2022.
- Straightening out the straight-through estimator: overcoming optimization challenges in vector quantized networks. In International Conference on Machine Learning, 2023.
- Generative models for graph-based protein design. In Advances in Neural Information Processing Systems, 2019.
- J. Jänes and P. Beltrao. Deep learning for protein structure prediction and design—progress and applications. Molecular Systems Biology, 2024.
- Learning from protein structure with geometric vector perceptrons. In International Conference on Learning Representations, 2021.
- Highly accurate protein structure prediction with alphafold. Nature, 2021.
- A new age in protein design empowered by deep learning. Cell Systems, 2023.
- Improved variational inference with inverse autoregressive flow. In Advances in Neural Information Processing Systems, 2016.
- Pesto: parameter-free geometric deep learning for accurate prediction of protein binding interfaces. Nature Communications, 2023.
- Proteinshake: Building datasets and benchmarks for deep learning on protein structures. In Advances in Neural Information Processing Systems Datasets and Benchmarks Track, 2023.
- Language models of protein sequences at the scale of evolution enable accurate structure prediction. bioRxiv, 2022.
- Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 2023.
- Visual instruction tuning. In Advances in neural information processing systems, 2024.
- I. Loshchilov and F. Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019.
- Finite scalar quantization: VQ-VAE made simple. In International Conference on Learning Representations, 2024.
- CATH – a hierarchic classification of protein domain structures. Structure, 1997.
- A. Radford and K. Narasimhan. Improving language understanding by generative pre-training. arXiv, 2018.
- Better language models and their implications. Technical report, OpenAI, 2019.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, 2021.
- Generating diverse high-fidelity images with VQ-VAE-2. In Advances in Neural Information Processing Systems, 2019.
- A generalist agent. arXiv, 2022.
- Chatnt: A multimodal conversational agent for dna, rna and protein tasks. bioRxiv, 2024.
- Contrasting sequence with structure: Pre-training graph representations with PLMs. In NeurIPS 2023 AI for Science Workshop, 2023.
- Current progress and open challenges for applying deep learning across the biosciences. Nature Communications, 2022.
- Small molecules, big targets: drug discovery faces the protein–protein interaction challenge. Nature Reviews Drug Discovery, 2016.
- Saprot: Protein language modeling with structure-aware vocabulary. In International Conference on Learning Representations, 2024.
- HQ-VAE: Hierarchical discrete representation learning with variational bayes. Transactions on Machine Learning Research, 2024.
- Diffusion probabilistic modeling of protein backbones in 3D for the motif-scaffolding problem. In International Conference on Learning Representations, 2023.
- Towards generalist biomedical AI. NEJM AI, 2024.
- Neural discrete representation learning. In Advances in Neural Information Processing Systems, 2017.
- Fast and accurate protein structure search with foldseek. Nature Biotechnology, 2024.
- Attention is all you need. In Advances in Neural Information Processing Systems, 2017.
- De novo design of protein structure and function with RFdiffusion. Nature, 2023.
- Protein structure generation via folding diffusion. Nature Communications, 2024.
- Me llama: Foundation large language models for medical applications. arXiv, 2024.
- J. Xu and Zhang Y. How significant is a protein structure similarity with TM-score = 0.5? Bioinformatics, 2010.
- Tm-align: a protein structure alignment algorithm based on the tm-score. Nucleic Acids Residues, 2005.
- VQGraph: Rethinking graph representation space for bridging GNNs and MLPs. In International Conference on Learning Representations, 2024a.
- Advancing multimodal medical capabilities of gemini. arXiv, 2024b.
- SE(3) diffusion model with application to protein backbone generation. In International Conference on Machine Learning, 2023.
- Vector-quantized image modeling with improved VQGAN. In International Conference on Learning Representations, 2022a.
- Scaling autoregressive models for content-rich text-to-image generation. Transactions on Machine Learning Research, 2022b.
- Masked audio generation using a single non-autoregressive transformer. In International Conference on Learning Representations, 2024.
- Benoit Gaujac (4 papers)
- Liviu Copoiu (1 paper)
- Timothy Atkinson (9 papers)
- Thomas Pierrot (21 papers)
- Thomas D. Barrett (22 papers)
- Jérémie Donà (3 papers)