ProtVec: A Continuous Distributed Representation of Biological Sequences (1503.05140v2)

Published 17 Mar 2015 in q-bio.QM, cs.AI, cs.LG, and q-bio.GN

Abstract: We introduce a new representation and feature extraction method for biological sequences. Named bio-vectors (BioVec) to refer to biological sequences in general with protein-vectors (ProtVec) for proteins (amino-acid sequences) and gene-vectors (GeneVec) for gene sequences, this representation can be widely used in applications of deep learning in proteomics and genomics. In the present paper, we focus on protein-vectors that can be utilized in a wide array of bioinformatics investigations such as family classification, protein visualization, structure prediction, disordered protein identification, and protein-protein interaction prediction. In this method, we adopt artificial neural network approaches and represent a protein sequence with a single dense n-dimensional vector. To evaluate this method, we apply it in classification of 324,018 protein sequences obtained from Swiss-Prot belonging to 7,027 protein families, where an average family classification accuracy of 93%+-0.06% is obtained, outperforming existing family classification methods. In addition, we use ProtVec representation to predict disordered proteins from structured proteins. Two databases of disordered sequences are used: the DisProt database as well as a database featuring the disordered regions of nucleoporins rich with phenylalanine-glycine repeats (FG-Nups). Using support vector machine classifiers, FG-Nup sequences are distinguished from structured protein sequences found in Protein Data Bank (PDB) with a 99.8% accuracy, and unstructured DisProt sequences are differentiated from structured DisProt sequences with 100.0% accuracy. These results indicate that by only providing sequence data for various proteins into this model, accurate information about protein structure can be determined.

Citations (310)

View on Semantic Scholar

Summary

The paper introduces ProtVec, an NLP-inspired model that transforms protein sequences into numerical embeddings capturing key biophysical properties.
The methodology uses a Skip-gram model to embed protein sequences, achieving 93% accuracy in classifying over 7,000 protein families.
The study demonstrates practical benefits for deep proteomics and genomics by providing a versatile, pre-trained tool for bioinformatics analysis.

Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics

The paper "Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics" by Ehsaneddin Asgari and Mohammad R. K. Mofrad presents a novel approach to biological sequence representation. The authors introduce bio-vectors (BioVec), including protein-vectors (ProtVec) and gene-vectors (GeneVec), harnessing methodologies derived from NLP to model biological sequences. This paper is significant in its pursuit of improving feature extraction and representation techniques pertinent to biological data, particularly enhancing the application of deep learning in proteomics and genomics.

Conceptual Contributions

The core contribution is the adaptation of distributed representation strategies from NLP for biological sequences. The authors utilize the Skip-gram model, a neural network-based technique, to embed sequences such as proteins into a dense n-dimensional space. Each protein sequence is transformed into a numerical representation that encapsulates biophysical and biochemical properties. This transformation allows for computational models to engage directly with biological data without extensive preprocessing features, such as hydrophobicity or polarity metrics, which were conventionally necessary in similar studies.

Key Numerical Findings

The approach was empirically validated through the application of ProtVec in classifying protein families. Using 324,018 protein sequences from Swiss-Prot, the model achieved an accuracy of 93% ± 0.06% for classifying 7,027 protein families. This accuracy surpasses many existing classification methodologies that rely on more traditional and feature-heavy approaches. Additionally, the method effectively distinguished disordered proteins, with accuracy reaching 99.8% when discerning FG-Nup sequences from structured proteins and achieving 100% accuracy in differentiating between disordered and ordered DisProt sessions. These findings underscore the robustness and precision of the ProtVec model in recognizing patterns within protein data.

Implications and Future Directions

The implications of this work are manifold. Practically, ProtVec offers a pre-trained and versatile tool for a breadth of bioinformatics applications, extending from structural prediction to protein interaction mapping. Theoretically, the successful implementation of NLP-style embeddings into bioinformatic realms prompts further investigation into analogous techniques that might be transferrable across domains.

Future developments could include expanding this framework to integrate more comprehensive biological contexts or combining with other machine learning frameworks, potentially leading to a generalized platform for biomolecular data interpretation. The idea of distributing similar representation techniques to nucleic acid sequences, like DNA or RNA, could further unify the data-driven approach in genomics.

Finally, the release of related data and web-based tools facilitates ongoing research by providing accessible resources for deploying and refining these models. This accessibility can accelerate community-driven improvements and adaptations of the ProtVec model for emergent biological questions and datasets.

In conclusion, Asgari and Mofrad's work is pivotal in illustrating the potential of NLP-inspired techniques in the bioinformatics landscape, paving the way for deeper integration of deep learning in biological sequence analysis.

PDF Markdown