VQPL: Vector Quantized Protein Language (2310.04985v1)
Abstract: Is there a foreign language describing protein sequences and structures simultaneously? Protein structures, represented by continuous 3D points, have long posed a challenge due to the contrasting modeling paradigms of discrete sequences. To represent protein sequence-structure as discrete symbols, we propose a VQProteinformer to project residue types and structures into a discrete space, supervised by a reconstruction loss to ensure information preservation. The sequential latent codes of residues introduce a new quantized protein language, transforming the protein sequence-structure into a unified modality. We demonstrate the potential of the created protein language on predictive and generative tasks, which may not only advance protein research but also establish a connection between the protein-related and NLP-related fields. The proposed method will be continually improved to unify more protein modalities, including text and point cloud.