Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

FoldToken: Learning Protein Language via Vector Quantization and Beyond (2403.09673v2)

Published 4 Feb 2024 in q-bio.BM, cs.AI, and cs.LG

Abstract: Is there a foreign language describing protein sequences and structures simultaneously? Protein structures, represented by continuous 3D points, have long posed a challenge due to the contrasting modeling paradigms of discrete sequences. We introduce \textbf{FoldTokenizer} to represent protein sequence-structure as discrete symbols. This innovative approach involves projecting residue types and structures into a discrete space, guided by a reconstruction loss for information preservation. We refer to the learned discrete symbols as \textbf{FoldToken}, and the sequence of FoldTokens serves as a new protein language, transforming the protein sequence-structure into a unified modality. We apply the created protein language on general backbone inpainting and antibody design tasks, building the first GPT-style model (\textbf{FoldGPT}) for sequence-structure co-generation with promising results. Key to our success is the substantial enhancement of the vector quantization module, Soft Conditional Vector Quantization (\textbf{SoftCVQ}).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. Protein generation with evolutionary diffusion: sequence is all you need. bioRxiv, pp.  2023–09, 2023.
  2. Clustering predicted structures at the scale of the known protein universe. bioRxiv, pp.  2023–03, 2023.
  3. Language models are few-shot learners. NeurIPS, 33:1877–1901, 2020.
  4. Differentiable product quantization for end-to-end embedding compression. In ICML, pp.  1617–1626. PMLR, 2020.
  5. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  6. Sabdab: the structural antibody database. Nucleic acids research, 42(D1):D1140–D1146, 2014.
  7. Ig-vae: Generative modeling of protein structure by direct 3d coordinate generation. PLoS computational biology, 18(6):e1010271, 2022.
  8. Image compression with product quantized masked image modeling. arXiv preprint arXiv:2212.07372, 2022.
  9. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  12873–12883, 2021.
  10. Continuous-discrete convolution for geometry-sequence modeling in proteins. In The Eleventh International Conference on Learning Representations, 2022.
  11. Alphadesign: A graph protein design method and benchmark on alphafolddb. arXiv preprint arXiv:2202.01079, 2022.
  12. Diffsds: A language diffusion model for protein backbone inpainting under geometric conditions and constraints. arXiv preprint arXiv:2301.09642, 2023a.
  13. Knowledge-design: Pushing the limit of protein deign via knowledge refinement. arXiv preprint arXiv:2305.15151, 2023b.
  14. Pifold: Toward effective and efficient protein inverse folding. In International Conference on Learning Representations, 2023c. URL https://openreview.net/forum?id=oMsN9TYwJ0j.
  15. Streaming active learning for regression problems using regression via classification. arXiv preprint arXiv:2309.01013, 2023.
  16. Learning inverse folding from millions of predicted structures. bioRxiv, 2022.
  17. Learning complete protein representation by deep coupling of sequence and structure. bioRxiv, pp.  2023–07, 2023.
  18. Multimodal distillation of protein sequence, structure, and function, 2024. URL https://openreview.net/forum?id=O0dW800ukz.
  19. Generative models for graph-based protein design. 2019.
  20. Illuminating protein space with a programmable generative model. Nature, pp.  1–9, 2023.
  21. Iterative refinement graph neural network for antibody sequence-structure co-design. arXiv preprint arXiv:2110.04624, 2021.
  22. Learning from protein structure with geometric vector perceptrons. arXiv:2009.01411, 2020.
  23. Multiple stage vector quantization for speech coding. In ICASSP, pp.  597–600. IEEE, 1982.
  24. Product quantization for nearest neighbor search. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(1):117–128, 2011. doi: 10.1109/TPAMI.2010.57.
  25. Conditional antibody design as 3d equivariant graph translation. arXiv preprint arXiv:2208.06073, 2022.
  26. Autoregressive image generation using residual quantization. In CVPR, pp.  11523–11532, 2022a.
  27. Proteinsgm: Score-based generative modeling for de novo protein design. bioRxiv, pp.  2022–07, 2022b.
  28. Language models of protein sequences at the scale of evolution enable accurate structure prediction. bioRxiv, 2022.
  29. Antigen-specific antibody design and optimization with diffusion-based generative models for protein structures. NeurIPS, 35:9754–9767, 2022.
  30. Stacked quantizers for compositional vector compression. arXiv preprint arXiv:1411.2173, 2014.
  31. Finite scalar quantization: Vq-vae made simple. arXiv preprint arXiv:2309.15505, 2023.
  32. A step towards understanding why classification helps regression. In ICCV, pp.  19972–19981, 2023.
  33. Scones: self-consistent neural network for protein stability prediction upon mutation. The Journal of Physical Chemistry B, 125(38):10657–10671, 2021.
  34. Protein sequence and structure co-design with equivariant translation. arXiv preprint arXiv:2210.08761, 2022.
  35. Functional geometry guided protein sequence and backbone structure co-design. arXiv preprint arXiv:2310.04343, 2023.
  36. Regression as classification: Influence of task formulation on neural network features. In ICAIS, pp.  11563–11582. PMLR, 2023.
  37. Generative de novo protein design with global context. arXiv preprint arXiv:2204.10673, 2022.
  38. Cross-gate mlp with protein complex invariant embedding is a one-shot antibody designer. arXiv e-prints, pp.  arXiv–2305, 2023a.
  39. Protein complex invariant embedding with cross-gate mlp is a one-shot antibody designer. arXiv preprint arXiv:2305.09480, 2023b.
  40. Diffusion probabilistic modeling of protein backbones in 3d for the motif-scaffolding problem. arXiv preprint arXiv:2206.04119, 2022.
  41. Prostata: Protein stability assessment using transformers. bioRxiv, pp.  2022–12, 2022.
  42. Neural discrete representation learning. Advances in neural information processing systems, 30, 2017.
  43. De novo design of protein structure and function with rfdiffusion. Nature, 620(7976):1089–1100, 2023.
  44. Protein structure generation via folding diffusion. 2022.
  45. Vector-quantized image modeling with improved vqgan. arXiv preprint arXiv:2110.04627, 2021.
  46. Language model beats diffusion–tokenizer is key to visual generation. arXiv preprint arXiv:2310.05737, 2023.
  47. Soundstream: An end-to-end neural audio codec. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:495–507, 2021.
  48. Efficiently predicting protein stability changes upon single-point mutation with large language models. arXiv preprint arXiv:2312.04019, 2023.
  49. Protein representation learning by geometric structure pretraining. arXiv preprint arXiv:2203.06125, 2022.
  50. Mmdesign: Multi-modality transfer learning for generative protein design. arXiv preprint arXiv:2312.06297, 2023.
Citations (7)

Summary

We haven't generated a summary for this paper yet.