ProDiGe: PRioritization Of Disease Genes with multitask machine learning from positive and unlabeled examples (1106.0134v1)

Published 1 Jun 2011 in q-bio.QM and stat.ML

Abstract: Elucidating the genetic basis of human diseases is a central goal of genetics and molecular biology. While traditional linkage analysis and modern high-throughput techniques often provide long lists of tens or hundreds of disease gene candidates, the identification of disease genes among the candidates remains time-consuming and expensive. Efficient computational methods are therefore needed to prioritize genes within the list of candidates, by exploiting the wealth of information available about the genes in various databases. Here we propose ProDiGe, a novel algorithm for Prioritization of Disease Genes. ProDiGe implements a novel machine learning strategy based on learning from positive and unlabeled examples, which allows to integrate various sources of information about the genes, to share information about known disease genes across diseases, and to perform genome-wide searches for new disease genes. Experiments on real data show that ProDiGe outperforms state-of-the-art methods for the prioritization of genes in human diseases.

PDF Abstract

ProDiGe: Prioritization of Disease Genes Using Multitask Machine Learning

The paper introduces ProDiGe, a computational approach designed to prioritize disease genes, leveraging multitask machine learning to handle positive and unlabeled data. The method takes advantage of various data sources and aims to improve the efficiency of identifying disease-causing genes, which is traditionally a costly and time-consuming process.

Overview of the Problem and Methodology

Identifying genes associated with human diseases is crucial for advancing diagnoses and therapies. Despite advancements in genomic and proteomic technologies, these methods typically generate extensive lists of candidate genes, prompting the need for computational prioritization strategies. ProDiGe tackles this challenge by adopting a guilt-by-association strategy for gene prioritization, enriched by the integration of heterogeneous data sources and multitask learning. This approach differentiates itself from others by simultaneously considering candidate genes and known disease genes to refine predictions more effectively.

The ProDiGe algorithm innovates by employing a machine learning perspective known as Learning from Positive and Unlabeled (PU) examples. This paradigm exploits the availability of labeled (positive) examples of disease genes and a larger set of unlabeled but potential candidates. The methodology uses Support Vector Machines (SVMs) in a multitask learning framework to enhance the predictive accuracy across different diseases by borrowing strength from their phenotypic similarities.

Results and Comparison with Existing Methods

ProDiGe demonstrates superior performance over existing state-of-the-art methods, such as Endeavour and PRINCE, in several validation experiments. Conducted on data from the Online Mendelian Inheritance in Man (OMIM) database, ProDiGe effectively ranks correct disease genes in the top 5% of candidate lists for 69% of diseases with known causal genes and 67% for orphan diseases. These assessments utilize cumulative distribution functions of ranks and show that incorporating disease phenotypic data significantly enhances performance compared to generic multitask strategies.

ProDiGe's multitask approach is especially effective when sharing information between diseases, achieving better rankings than PRINCE without the need to rely solely on protein-protein interaction networks. Notably, when the method is adjusted to use only PPI networks, ProDiGe rivals the results obtained by PRINCE, highlighting its versatility and the efficacy of the machine learning approach, even when specific data sources are restricted.

In scenarios where training positive examples are scarce, usually the bottleneck in gene prioritization, ProDiGe demonstrates robustness by leveraging related disease data through phenotypic similarity, showing the potential for discovering causal genes for orphan diseases.

Theoretical and Practical Implications

The introduction of PU learning into the gene prioritization problem offers substantial theoretical contributions to both bioinformatics and the machine learning fields. It demonstrates that integrating unlabeled data can significantly refine predictive models, a concept applicable to various biological prediction tasks beyond disease gene prioritization.

Practically, ProDiGe's ability to integrate multiple data sources while efficiently prioritizing disease genes presents a comprehensive tool for researchers. The method's validation on numerous diseases, leveraging phenotypic similarities across them, indicates substantial utility for both well-characterized and less understood disorders.

Future Directions

The results endorse further exploration of multitask learning and data integration strategies in complex biological datasets. Future research could enhance phenotype descriptors, incorporate broader genomic contexts, and evaluate additional machine learning algorithms to expand these approaches' applicability and precision. The extension of PU learning frameworks might also catalyze novel insights in other domains requiring prioritization among vast candidate lists under uncertain conditions.

In conclusion, ProDiGe exemplifies an advancement in computational genetics, effectively combining diverse datasets with powerful machine learning strategies to address longstanding challenges in disease gene identification.

PDF Markdown Bookmark Chat (Pro)

Authors (2)

Fantine Mordelet (4 papers)
Jean-Philippe Vert (41 papers)

Citations (166)

View on Semantic Scholar