A bagging SVM to learn from positive and unlabeled examples (1010.0772v1)

Published 5 Oct 2010 in stat.ML

Abstract: We consider the problem of learning a binary classifier from a training set of positive and unlabeled examples, both in the inductive and in the transductive setting. This problem, often referred to as \emph{PU learning}, differs from the standard supervised classification problem by the lack of negative examples in the training set. It corresponds to an ubiquitous situation in many applications such as information retrieval or gene ranking, when we have identified a set of data of interest sharing a particular property, and we wish to automatically retrieve additional data sharing the same property among a large and easily available pool of unlabeled data. We propose a conceptually simple method, akin to bagging, to approach both inductive and transductive PU learning problems, by converting them into series of supervised binary classification problems discriminating the known positive examples from random subsamples of the unlabeled set. We empirically demonstrate the relevance of the method on simulated and real data, where it performs at least as well as existing methods while being faster.

Citations (270)

View on Semantic Scholar

Summary

The paper introduces a bagging SVM method that transforms PU learning into a series of binary classification tasks.
It strategically subsamples unlabeled data to address contamination variability and enhance classifier stability.
Empirical results on text retrieval and gene inference demonstrate improved performance over biased SVM alternatives.

A Bagging SVM Approach for Positive-Unlabeled Learning

The paper introduces a novel approach to tackle the problem of learning from positive and unlabeled (PU) data. The core contribution is the adaptation of the bagging method, typically used in supervised learning settings, for PU learning contexts. In contrast to standard supervised learning tasks, PU learning involves datasets with only positive and unlabeled instances, with the primary challenge being the absence of explicit negative examples.

Conceptual Framework

The authors propose a conceptual method akin to bagging for both inductive and transductive PU learning. By framing the problem as a series of supervised binary classification tasks, they effectively transform the challenge into discriminating known positive examples from random subsets of unlabeled data. This transformation addresses the variability in the contamination rate of unlabeled datasets with positive samples, which significantly impacts classification performance.

The proposed method, termed "bagging SVM," involves aggregating classifiers each trained to distinguish known positives from a random sample of the unlabeled pool. This approach leverages the known instability of classifiers confronted by varying contamination rates, a condition typical in PU settings.

Methodology

Bagging SVM involves the allocation of varying weights to false negatives and positives, thereby accounting for the asymmetry inherent in PU learning situations. This is akin to biased SVM or weighted logistic regression approaches. However, the novelty lies in the strategic utilization of subsampling the unlabeled data.

Key parameters within this framework include the size 'K' of the subsample from the unlabeled set, which acts as a crucial determinant for balancing classifier robustness against variance. It posits that smaller 'K' values enhance variance due to increased potential diversity in positive contamination rates.

Empirical Evaluation

Empirical verification on both simulated and real-world datasets, such as text retrieval from the 20 Newsgroups dataset and gene regulation network inference in Escherichia coli, illustrates the practical relevance of bagging SVM. On simulated datasets, it outperforms biased SVM methods by providing a better trade-off between bias and variance through appropriate 'K' selection. Meanwhile, real-world applications emphasize computational efficiency alongside matching or exceeding standard PU methods' performance. Bagging SVM exhibited robustness in scenarios with high ratios of unlabeled to positive examples, underscoring computational benefits due to its reduced training set requirements.

Implications and Future Prospects

This method holds significant potential for applications where acquiring labeled negative samples is infeasible or costly, such as information retrieval and biological data classification. The paper points toward an increased understanding of how learning algorithm stability, impacted by contamination variability, can be strategically managed.

Future research could explore extensions of the bagging concept beyond SVMs, investigating other classification algorithms that might gain from this approach, or optimize parameter selection, such as 'K', through adaptive strategies. Additionally, applying this methodology to large-scale datasets or integrating it with deep learning frameworks could offer further enhancements and broaden its impact across various domains.

In summary, this paper provides a significant contribution to the field of PU learning by introducing a computationally effective and theoretically grounded approach that exploits the unique challenges and opportunities presented by such datasets.

PDF Markdown