- The paper introduces ProLanGO, a novel model that treats protein function prediction as a neural machine translation problem between protein sequences (ProLan) and function terms (GOLan).
- ProLanGO uses recurrent neural networks to 'translate' protein sequences, represented as k-mers in ProLan, into predicted Gene Ontology (GO) terms structured in the GOLan language.
- Quantitative evaluation shows ProLanGO performs well among sequence-based methods but less so than top homology-based approaches, highlighting potential for future refinement and scalability.
ProLanGO: A Novel Approach to Protein Function Prediction Utilizing Neural Machine Translation
This paper introduces ProLanGO, a model that innovatively repurposes neural machine translation (NMT) techniques for the task of protein function prediction. The authors investigate a method that converts protein sequences into a conceptual language, dubbed "ProLan," and maps protein functions to another conceptual language, "GOLan," allowing the problem to be approached as one of language translation. This seminal approach employs recurrent neural networks (RNNs) to translate between these custom languages.
Methodological Framework
Data Collection and Language Construction: The authors employ the UniProtKB knowledge database to gather data, focusing on 523,990 protein sequences and 42,819 valid Gene Ontology (GO) terms. ProLanGO converts protein sequences into a sequence of k-mers, considered as "words" in the ProLan language. For the GOLan language, GO terms, traditionally represented by numeric identifiers, are transformed into a concise, tree-structured alphabetical format, capitalizing on the directed acyclic graph of the GO ontology.
Model Architecture: ProLanGO employs RNNs to conduct the language translation. Notably, the model utilizes a combination of conventional NMT and an extended NMT configuration, which includes a broader scope of function predictions by extending the translation output to additional GO term descendants, thereby mitigating the bounded nature of the earlier configurations.
Performance Evaluation
To benchmark the model's performance, several testing regimes were realized, including blind testing via participation in the CAFA3 challenge. During these evaluations, ProLanGO was compared against existing methodologies, such as the SMISS, PANNZER, FANN-GO, and DeepGO models.
Quantitative Findings: Through the employment of precision and recall metrics across top-n predictions, ProLanGO demonstrated improved performance over some sequence-based prediction models but showed limitations when compared to the top-tier homology-based methods. With results detailed for various thresholds, the model's capacity in translating protein sequences to predicted functions demonstrated both strengths in sequence-only approaches and highlighted areas for future refinement.
Theoretical and Practical Implications
The conversion of protein function prediction into an NMT problem offers fresh perspectives on how sequence information can be leveraged without resorting to auxiliary database searches that depend on homology. This has significant implications for the field, offering a potentially scalable approach in handling the vast amount of sequence information available, particularly for newly sequenced proteins.
Future Research Directions
The paper suggests several avenues for future research that could enhance the ProLanGO model's predictive power. Among these is the potential incorporation of biologically meaningful sequence fragments into the ProLan language and improved ranking methodologies for predicted GO terms. The possibility of increasing k-mer lengths past those currently in use is also noted as a method for enhancing linguistic richness within the ProLan framework. Additionally, alternative methods for bucketing and padding may offer improved efficiency and accuracy in handling variable-length sequence inputs.
In conclusion, the ProLanGO model represents a pioneering step in applying advanced machine learning methodologies to the domain of protein function prediction, with both the challenges and promising results detailed in the paper serving as a springboard for future innovations in computational biology.