OpenBioLink: A benchmarking framework for large-scale biomedical link prediction (1912.04616v2)

Published 10 Dec 2019 in cs.AI and cs.IR

Abstract: SUMMARY: Recently, novel machine-learning algorithms have shown potential for predicting undiscovered links in biomedical knowledge networks. However, dedicated benchmarks for measuring algorithmic progress have not yet emerged. With OpenBioLink, we introduce a large-scale, high-quality and highly challenging biomedical link prediction benchmark to transparently and reproducibly evaluate such algorithms. Furthermore, we present preliminary baseline evaluation results. AVAILABILITY AND IMPLEMENTATION: Source code, data and supplementary files are openly available at https://github.com/OpenBioLink/OpenBioLink CONTACT: matthias.samwald ((at)) meduniwien.ac.at

Citations (53)

View on Semantic Scholar

Summary

The paper introduces OpenBioLink, a comprehensive benchmarking framework designed to rigorously evaluate biomedical link prediction algorithms.
It addresses biases by filtering out trivially inferable links, ensuring valid separation between training and testing datasets.
Baseline tests with TransE and TransR achieved a hits@10 of 7.5%, highlighting significant room for improvement in prediction methods.

OpenBioLink: A Benchmarking Framework for Biomedical Link Prediction

The paper introduces OpenBioLink, a comprehensive benchmarking framework designed explicitly for evaluating link prediction algorithms within large-scale biomedical knowledge graphs. This work underscores the necessity for a bespoke benchmarking suite considering the unique characteristics and demands of the biomedical domain, where existing benchmarks like FB15K, WN18, and UMLS fall short. OpenBioLink is an attempt to fill the void by offering a robust, transparent, and adaptable evaluation platform.

The difficulty in adopting existing biomedical knowledge graphs for effective benchmarking is addressed through meticulous design choices in OpenBioLink. These graphs, like Bio2RDF, inherently contain metadata relations that can bias link prediction algorithm performances, necessitating the exclusion of trivially inferable links in constructing meaningful test sets. The introduction of OpenBioLink sets a precedent for domain-specific benchmarks by avoiding information leakage between train and test datasets, a common pitfall in prior works, leading to more valid assessments of prediction capabilities.

Methodological Advances and Baseline Results

The paper initially benchmarks with traditional graph embedding techniques, namely TransE and TransR. Despite their simplicity, these models were optimized through hyperparameter tuning, followed by evaluation on the OpenBioLink dataset. The results yielded a hits@10 of 7.5%, demonstrating the potential yet highlighting significant room for advancement. Future evaluations with more sophisticated algorithms, like meta-path-based and scalable rule-learning methods, are envisaged.

OpenBioLink's availability, including its datasets and benchmarks, via an open-source platform, enhances reproducibility and accessibility. Data is provided in multiple quality-filter settings to accommodate various needs across different prediction methods. Moreover, the framework supports multiple metrics, such as hits@k, MRR, ROC AUC, and PR AUC, ensuring comprehensive performance assessment across models.

Significance and Future Directions

The advent of OpenBioLink provides researchers a solid foundation to evaluate and develop link prediction algorithms tailored to the intricacies of biomedical knowledge graphs. This is crucial as biomedical databases often encompass richly structured ontological hierarchies and complex interaction networks that challenge traditional link prediction approaches.

The establishment of OpenBioLink as a baseline facilitates the objective measurement of algorithmic progress in biomedical link prediction. By organizing annual benchmarking events, the framework encourages continuous community involvement and refinement. This collective effort aims to bolster the development of novel algorithms that can more effectively leverage biomedical knowledge bases for hypothesis generation and potentially expedite biomedicine's progress through improved decision support systems.

In summary, OpenBioLink presents itself not only as a much-needed benchmarking tool but also as a catalyst for collaborative research, ultimately enhancing the ability to perform meaningful link predictions within the biomedical domain. Future iterations and expansions of the benchmark are likely to integrate additional resources, such as Hetionet, broadening the framework's applicability and interest. Such developments hold the promise of driving substantial advancements in both the theoretical and practical landscapes of AI in biomedicine.