FoldToken: Learning Protein Language via Vector Quantization and Beyond (2403.09673v2)
Abstract: Is there a foreign language describing protein sequences and structures simultaneously? Protein structures, represented by continuous 3D points, have long posed a challenge due to the contrasting modeling paradigms of discrete sequences. We introduce \textbf{FoldTokenizer} to represent protein sequence-structure as discrete symbols. This innovative approach involves projecting residue types and structures into a discrete space, guided by a reconstruction loss for information preservation. We refer to the learned discrete symbols as \textbf{FoldToken}, and the sequence of FoldTokens serves as a new protein language, transforming the protein sequence-structure into a unified modality. We apply the created protein language on general backbone inpainting and antibody design tasks, building the first GPT-style model (\textbf{FoldGPT}) for sequence-structure co-generation with promising results. Key to our success is the substantial enhancement of the vector quantization module, Soft Conditional Vector Quantization (\textbf{SoftCVQ}).
- Protein generation with evolutionary diffusion: sequence is all you need. bioRxiv, pp. 2023–09, 2023.
- Clustering predicted structures at the scale of the known protein universe. bioRxiv, pp. 2023–03, 2023.
- Language models are few-shot learners. NeurIPS, 33:1877–1901, 2020.
- Differentiable product quantization for end-to-end embedding compression. In ICML, pp. 1617–1626. PMLR, 2020.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Sabdab: the structural antibody database. Nucleic acids research, 42(D1):D1140–D1146, 2014.
- Ig-vae: Generative modeling of protein structure by direct 3d coordinate generation. PLoS computational biology, 18(6):e1010271, 2022.
- Image compression with product quantized masked image modeling. arXiv preprint arXiv:2212.07372, 2022.
- Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 12873–12883, 2021.
- Continuous-discrete convolution for geometry-sequence modeling in proteins. In The Eleventh International Conference on Learning Representations, 2022.
- Alphadesign: A graph protein design method and benchmark on alphafolddb. arXiv preprint arXiv:2202.01079, 2022.
- Diffsds: A language diffusion model for protein backbone inpainting under geometric conditions and constraints. arXiv preprint arXiv:2301.09642, 2023a.
- Knowledge-design: Pushing the limit of protein deign via knowledge refinement. arXiv preprint arXiv:2305.15151, 2023b.
- Pifold: Toward effective and efficient protein inverse folding. In International Conference on Learning Representations, 2023c. URL https://openreview.net/forum?id=oMsN9TYwJ0j.
- Streaming active learning for regression problems using regression via classification. arXiv preprint arXiv:2309.01013, 2023.
- Learning inverse folding from millions of predicted structures. bioRxiv, 2022.
- Learning complete protein representation by deep coupling of sequence and structure. bioRxiv, pp. 2023–07, 2023.
- Multimodal distillation of protein sequence, structure, and function, 2024. URL https://openreview.net/forum?id=O0dW800ukz.
- Generative models for graph-based protein design. 2019.
- Illuminating protein space with a programmable generative model. Nature, pp. 1–9, 2023.
- Iterative refinement graph neural network for antibody sequence-structure co-design. arXiv preprint arXiv:2110.04624, 2021.
- Learning from protein structure with geometric vector perceptrons. arXiv:2009.01411, 2020.
- Multiple stage vector quantization for speech coding. In ICASSP, pp. 597–600. IEEE, 1982.
- Product quantization for nearest neighbor search. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(1):117–128, 2011. doi: 10.1109/TPAMI.2010.57.
- Conditional antibody design as 3d equivariant graph translation. arXiv preprint arXiv:2208.06073, 2022.
- Autoregressive image generation using residual quantization. In CVPR, pp. 11523–11532, 2022a.
- Proteinsgm: Score-based generative modeling for de novo protein design. bioRxiv, pp. 2022–07, 2022b.
- Language models of protein sequences at the scale of evolution enable accurate structure prediction. bioRxiv, 2022.
- Antigen-specific antibody design and optimization with diffusion-based generative models for protein structures. NeurIPS, 35:9754–9767, 2022.
- Stacked quantizers for compositional vector compression. arXiv preprint arXiv:1411.2173, 2014.
- Finite scalar quantization: Vq-vae made simple. arXiv preprint arXiv:2309.15505, 2023.
- A step towards understanding why classification helps regression. In ICCV, pp. 19972–19981, 2023.
- Scones: self-consistent neural network for protein stability prediction upon mutation. The Journal of Physical Chemistry B, 125(38):10657–10671, 2021.
- Protein sequence and structure co-design with equivariant translation. arXiv preprint arXiv:2210.08761, 2022.
- Functional geometry guided protein sequence and backbone structure co-design. arXiv preprint arXiv:2310.04343, 2023.
- Regression as classification: Influence of task formulation on neural network features. In ICAIS, pp. 11563–11582. PMLR, 2023.
- Generative de novo protein design with global context. arXiv preprint arXiv:2204.10673, 2022.
- Cross-gate mlp with protein complex invariant embedding is a one-shot antibody designer. arXiv e-prints, pp. arXiv–2305, 2023a.
- Protein complex invariant embedding with cross-gate mlp is a one-shot antibody designer. arXiv preprint arXiv:2305.09480, 2023b.
- Diffusion probabilistic modeling of protein backbones in 3d for the motif-scaffolding problem. arXiv preprint arXiv:2206.04119, 2022.
- Prostata: Protein stability assessment using transformers. bioRxiv, pp. 2022–12, 2022.
- Neural discrete representation learning. Advances in neural information processing systems, 30, 2017.
- De novo design of protein structure and function with rfdiffusion. Nature, 620(7976):1089–1100, 2023.
- Protein structure generation via folding diffusion. 2022.
- Vector-quantized image modeling with improved vqgan. arXiv preprint arXiv:2110.04627, 2021.
- Language model beats diffusion–tokenizer is key to visual generation. arXiv preprint arXiv:2310.05737, 2023.
- Soundstream: An end-to-end neural audio codec. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:495–507, 2021.
- Efficiently predicting protein stability changes upon single-point mutation with large language models. arXiv preprint arXiv:2312.04019, 2023.
- Protein representation learning by geometric structure pretraining. arXiv preprint arXiv:2203.06125, 2022.
- Mmdesign: Multi-modality transfer learning for generative protein design. arXiv preprint arXiv:2312.06297, 2023.