Learning from positive and unlabeled data: a survey (1811.04820v3)

Published 12 Nov 2018 in cs.LG and stat.ML

Abstract: Learning from positive and unlabeled data or PU learning is the setting where a learner only has access to positive examples and unlabeled data. The assumption is that the unlabeled data can contain both positive and negative examples. This setting has attracted increasing interest within the machine learning literature as this type of data naturally arises in applications such as medical diagnosis and knowledge base completion. This article provides a survey of the current state of the art in PU learning. It proposes seven key research questions that commonly arise in this field and provides a broad overview of how the field has tried to address them.

Citations (504)

View on Semantic Scholar

Summary

The paper reviews state-of-the-art PU learning methodologies to tackle binary classification with incomplete labels.
It details three main approaches—two-step techniques, biased learning, and class prior incorporation—highlighting their assumptions and practical applications.
The survey evaluates challenges in model assessment and class prior estimation, offering actionable insights for advancing PU learning research.

Learning From Positive and Unlabeled Data: A Survey

The paper "Learning From Positive and Unlabeled Data" by Jessa Bekker and Jesse Davis serves as an extensive survey on the PU learning paradigm. PU learning is significant in scenarios where only positive and unlabeled data are available, and this data may also contain negative examples. The survey encapsulates state-of-the-art methodologies and research questions, providing a structured perspective on handling PU data.

Core Concepts in PU Learning

The authors introduce PU learning as a variant of binary classification, where only positive and unlabeled examples are accessible for training. Unlabeled data in this context may harbor both positive and negative instances. This form of data is prevalent in domains such as medical diagnosis and knowledge base completion, necessitating robust machine learning approaches to handle the scarcity of labeled negative examples.

Key Research Questions

The paper delineates essential questions in PU learning, examining topics such as problem formalization, typical assumptions about data, and approach frameworks for model training and evaluation. The survey identifies several open questions and systematically addresses them, highlighting both practical and theoretical implications in various applications.

Learning Methods

Central to the discussion are three main categories of PU learning approaches:

Two-Step Techniques: These methods start with identifying reliable negative instances from the unlabeled data, followed by applying standard supervised or semi-supervised learning techniques using the identified negatives and the known positives. The techniques often rely on separability and smoothness assumptions about the data.
Biased Learning: Here, the unlabeled data is assumed to be negatively labeled with the possibility of noise. Learning models assign different penalties for misclassified positives and negatives to account for label uncertainty.
Class Prior Incorporation: These methods utilize the class prior, which informs the probability distribution of classes, to adapt traditional learning algorithms for PU contexts. This includes modifying datasets to emulate fully labeled ones or altering algorithms to consider expected class distributions.

Evaluation of PU Models

Assessing model performance in a PU setting is non-trivial. The paper discusses methods to compute evaluation metrics that are typically challenging to derive with positive and unlabeled data. This includes leveraging the SCAR (Selected Completely At Random) assumption, among others, to estimate traditional metrics such as accuracy and the $F_1$ score.

Assumptions in PU Learning

A significant portion of the survey is dedicated to assumptions that facilitate learning in PU scenarios. These include:

Label Mechanism Assumptions: Mainly the SCAR and SAR (Selected At Random) assumptions, which define the likelihood of positive examples being labeled.
Data Assumptions: These cover concepts like separability, where the classes are assumed to be distinctly separable by a function, and smoothness, which implies that nearby instances have similar class probabilities.

Class Prior Estimation

Estimating class priors directly from PU data is a complex task because of non-identifiability. The paper reviews multiple methods under different assumptions, consequently making it easier to leverage class prior for effective learning in diverse applications.

Applications and Future Directions

Applications cited range from medical diagnosis to automated knowledge base completion, highlighting the versatility of PU learning paradigms in real-world settings. For future research, the paper suggests improving evaluation methods, understanding real-world PU data manifestations, and exploring learning in relational domains as promising areas.

Conclusion

This survey is an essential resource for researchers interested in PU learning, comprehensively covering existing methods and methodologies while pointing out gaps and opportunities for advancing this field. Through careful exploration of theoretical foundations and practical evaluations, this paper provides the necessary compass for future explorations in learning from positive and unlabeled data contexts.

PDF Markdown