ProLanGO: Protein Function Prediction Using Neural~Machine Translation Based on a Recurrent Neural Network (1710.07016v1)

Published 19 Oct 2017 in q-bio.QM and cs.LG

Abstract: With the development of next generation sequencing techniques, it is fast and cheap to determine protein sequences but relatively slow and expensive to extract useful information from protein sequences because of limitations of traditional biological experimental techniques. Protein function prediction has been a long standing challenge to fill the gap between the huge amount of protein sequences and the known function. In this paper, we propose a novel method to convert the protein function problem into a language translation problem by the new proposed protein sequence language "ProLan" to the protein function language "GOLan", and build a neural machine translation model based on recurrent neural networks to translate "ProLan" language to "GOLan" language. We blindly tested our method by attending the latest third Critical Assessment of Function Annotation (CAFA 3) in 2016, and also evaluate the performance of our methods on selected proteins whose function was released after CAFA competition. The good performance on the training and testing datasets demonstrates that our new proposed method is a promising direction for protein function prediction. In summary, we first time propose a method which converts the protein function prediction problem to a language translation problem and applies a neural machine translation model for protein function prediction.

Citations (167)

View on Semantic Scholar

Summary

The paper introduces ProLanGO, a novel model that treats protein function prediction as a neural machine translation problem between protein sequences (ProLan) and function terms (GOLan).
ProLanGO uses recurrent neural networks to 'translate' protein sequences, represented as k-mers in ProLan, into predicted Gene Ontology (GO) terms structured in the GOLan language.
Quantitative evaluation shows ProLanGO performs well among sequence-based methods but less so than top homology-based approaches, highlighting potential for future refinement and scalability.

ProLanGO: A Novel Approach to Protein Function Prediction Utilizing Neural Machine Translation

This paper introduces ProLanGO, a model that innovatively repurposes neural machine translation (NMT) techniques for the task of protein function prediction. The authors investigate a method that converts protein sequences into a conceptual language, dubbed "ProLan," and maps protein functions to another conceptual language, "GOLan," allowing the problem to be approached as one of language translation. This seminal approach employs recurrent neural networks (RNNs) to translate between these custom languages.

Methodological Framework

Data Collection and Language Construction: The authors employ the UniProtKB knowledge database to gather data, focusing on 523,990 protein sequences and 42,819 valid Gene Ontology (GO) terms. ProLanGO converts protein sequences into a sequence of k-mers, considered as "words" in the ProLan language. For the GOLan language, GO terms, traditionally represented by numeric identifiers, are transformed into a concise, tree-structured alphabetical format, capitalizing on the directed acyclic graph of the GO ontology.

Model Architecture: ProLanGO employs RNNs to conduct the language translation. Notably, the model utilizes a combination of conventional NMT and an extended NMT configuration, which includes a broader scope of function predictions by extending the translation output to additional GO term descendants, thereby mitigating the bounded nature of the earlier configurations.

Performance Evaluation

To benchmark the model's performance, several testing regimes were realized, including blind testing via participation in the CAFA3 challenge. During these evaluations, ProLanGO was compared against existing methodologies, such as the SMISS, PANNZER, FANN-GO, and DeepGO models.

Quantitative Findings: Through the employment of precision and recall metrics across top-n predictions, ProLanGO demonstrated improved performance over some sequence-based prediction models but showed limitations when compared to the top-tier homology-based methods. With results detailed for various thresholds, the model's capacity in translating protein sequences to predicted functions demonstrated both strengths in sequence-only approaches and highlighted areas for future refinement.

Theoretical and Practical Implications

The conversion of protein function prediction into an NMT problem offers fresh perspectives on how sequence information can be leveraged without resorting to auxiliary database searches that depend on homology. This has significant implications for the field, offering a potentially scalable approach in handling the vast amount of sequence information available, particularly for newly sequenced proteins.

Future Research Directions

The paper suggests several avenues for future research that could enhance the ProLanGO model's predictive power. Among these is the potential incorporation of biologically meaningful sequence fragments into the ProLan language and improved ranking methodologies for predicted GO terms. The possibility of increasing k-mer lengths past those currently in use is also noted as a method for enhancing linguistic richness within the ProLan framework. Additionally, alternative methods for bucketing and padding may offer improved efficiency and accuracy in handling variable-length sequence inputs.

In conclusion, the ProLanGO model represents a pioneering step in applying advanced machine learning methodologies to the domain of protein function prediction, with both the challenges and promising results detailed in the paper serving as a springboard for future innovations in computational biology.