Papers
Topics
Authors
Recent
2000 character limit reached

Prediction-Oriented Bayesian Active Learning

Published 17 Apr 2023 in cs.LG and stat.ML | (2304.08151v1)

Abstract: Information-theoretic approaches to active learning have traditionally focused on maximising the information gathered about the model parameters, most commonly by optimising the BALD score. We highlight that this can be suboptimal from the perspective of predictive performance. For example, BALD lacks a notion of an input distribution and so is prone to prioritise data of limited relevance. To address this we propose the expected predictive information gain (EPIG), an acquisition function that measures information gain in the space of predictions rather than parameters. We find that using EPIG leads to stronger predictive performance compared with BALD across a range of datasets and models, and thus provides an appealing drop-in replacement.

Citations (21)

Summary

  • The paper introduces EPIG, a novel acquisition function that directly targets predictive uncertainty instead of parameter uncertainty like BALD.
  • It derives EPIG from Bayesian experimental design, incorporating the target input distribution to focus on relevant predictive tasks.
  • Extensive experiments on synthetic, UCI, and MNIST datasets demonstrate EPIG’s superior performance and robustness over BALD.

Prediction-Oriented Bayesian Active Learning

This paper introduces the expected predictive information gain (EPIG) as a novel acquisition function for Bayesian active learning, addressing the limitations of the commonly used Bayesian active learning by disagreement (BALD) score. It highlights how BALD, which maximizes information gain about model parameters, can be suboptimal for predictive performance due to its lack of consideration for the input distribution and potential irrelevance to the predictive task. The authors propose EPIG, which directly targets information gain in predictions, making it a more suitable replacement for BALD in various scenarios.

Background and Motivation

The paper begins by reviewing the principles of active learning and Bayesian experimental design, emphasizing the importance of probabilistic generative models in Bayesian active learning. It highlights that BALD, while widely used, focuses on maximizing information gain in model parameters, which may not always align with the goal of making accurate predictions on unseen inputs. The authors argue that BALD lacks awareness of the input distribution, leading to the prioritization of irrelevant data, especially in real-world datasets with varying input relevance. The paper claims that BALD can be actively counterproductive in such cases, selecting obscure inputs that do not contribute to predictive performance. Figure 1

Figure 1: The expected predictive information gain (EPIG) can differ dramatically from the expected information gain in the model parameters (BALD).

The paper uses the example of a supervised-learning problem where x,yRx,y\in\mathbb{R} with a Gaussian process prior to demonstrate how a high BALD score need not coincide with any reduction in the predictive uncertainty of interest, EIGθ(x)\mathrm{EIG}_{\theta(x_*)}.

Expected Predictive Information Gain (EPIG)

To overcome BALD's limitations, the authors derive EPIG from the Bayesian experimental design framework. Unlike BALD, which targets parameter uncertainty, EPIG directly targets predictive uncertainty on inputs of interest. This is achieved by introducing a target input distribution, p(x)p_*(x_*), and defining the goal as confident prediction of labels, yy_*, associated with samples xp(x)x_* \sim p_*(x_*).

EPIG is defined as the expected reduction in predictive uncertainty at a randomly sampled target input, xx_*, offering an alternative interpretation as the mutual information between (x,y)(x_*,y_*) and yy given xx, denoted as I(x,y);yxI{(x_*,y_*);y|x}. The paper also presents a frequentist perspective, stating that in classification settings, EPIG is equivalent (up to a constant) to the negative expected generalization error under a cross-entropy loss. Figure 2

Figure 2: Active learning typically loops over selecting a query, acquiring a label and updating the model parameters.

Implementation Details

The practical implementation of EPIG involves estimating the expectation with respect to the target input distribution, p(x)p_*(x_*), using Monte Carlo methods. The paper discusses different scenarios for sampling xx_*, including subsampling from a pool of unlabeled inputs, using samples from p(x)p_*(x_*) when available, and approximating p(x)p_*(x_*) using the model and the pool in classification problems.

The authors present two estimators for EPIG. The first, suited to classification, is

EPIG(x)1Mj=1MKLp^ϕ(y,yx,xj)p^ϕ(yx)p^ϕ(yxj),\mathrm{EPIG}(x) \approx \frac{1}{M} \sum_{j=1}^{M} KL{\hat{p}_\phi(y,y_*|x,x_*^j)}{\hat{p}_\phi(y|x)\hat{p}_\phi(y_*|x_*^j)},

where xjp(x)x_*^j \sim p_*(x_*) and the p^\hat{p} terms are Monte Carlo approximations of predictive distributions. The second estimator, for cases where integration over yy and yy_* is not analytical, utilizes nested Monte Carlo estimation:

EPIG(x)1Mj=1MlogKi=1Kpϕ(yjx,θi)pϕ(yjxj,θi)i=1Kpϕ(yjx,θi)i=1Kpϕ(yjxj,θi),\mathrm{EPIG}(x) \approx \frac{1}{M} \sum_{j=1}^{M} \log \frac{K \sum_{i=1}^K p_\phi(y^{j}|x,\theta_i)p_\phi(y^{j}_*|x_*^{j},\theta_i)}{\sum_{i=1}^K p_\phi(y^{j}|x,\theta_i) \sum_{i=1}^K p_\phi(y^{j}_*|x_*^{j},\theta_i)},

where xjp(x)x_*^{j} \sim p_*(x_*), yj,yjpϕ(y,yx,xj)y^{j},y_*^{j}\sim p_\phi(y,y_*|x,x_*^{j}) and θipϕ(θ)\theta_i \sim p_\phi(\theta).

The paper acknowledges that the EPIG estimators have a computational cost of O(MK)O(MK), comparable to BALD estimation for regression but potentially more expensive for classification due to the inability to collapse to a non-nested Monte Carlo estimation.

Experimental Results

The paper presents a comprehensive empirical evaluation of EPIG, comparing it against BALD across various datasets and models. The experiments include synthetic data, UCI datasets, and MNIST variations, demonstrating EPIG's superior or comparable performance in different scenarios.

The synthetic data experiments visually illustrate BALD's tendency to acquire labels at the extrema of the input space, while EPIG focuses on regions relevant to the predictive task. The UCI dataset experiments show EPIG outperforming BALD in several classification problems, validating its effectiveness in broader settings. Figure 3

Figure 3: BALD can fail catastrophically on big pools.

The MNIST experiments include Curated MNIST, Unbalanced MNIST, and Redundant MNIST, designed to capture challenges in applying deep neural networks to high-dimensional inputs. EPIG consistently outperforms BALD, especially in Redundant MNIST, suggesting its usefulness in working with diverse pools. Figure 4

Figure 4: In contrast with BALD, EPIG deals effectively with a big pool (10510^5 unlabelled inputs).

Figure 5

Figure 5

Figure 5: EPIG outperforms or matches BALD across three standard classification problems from the UCI machine-learning repository (Magic, Satellite and Vowels) and two models (random forest and neural network).

Figure 6

Figure 6: EPIG outperforms BALD across three image-classification settings.

Figure 7

Figure 7: Even without knowledge of the target input distribution, p(x)p_*(x_*), EPIG retains its strong performance on Curated MNIST, Unbalanced MNIST and Redundant MNIST.

Furthermore, the paper investigates the sensitivity of EPIG to knowledge of the target data distribution, p(x)p_*(x_*). The results show that EPIG retains its strong performance even when this knowledge is limited or absent, demonstrating its robustness in practical scenarios. Figure 8

Figure 8: EPIG outperforms two acquisition functions popularly used as baselines in the active-learning literature.

The paper discusses related work in Bayesian experimental design and active learning, highlighting the historical focus on maximizing information gain in model parameters. It acknowledges the contributions of \citet{mackay1992evidence,mackay1992information} in introducing the mean marginal information gain and discusses other prediction-oriented methods for active learning.

Conclusion

This paper makes a strong case for EPIG as a more effective acquisition function for Bayesian active learning compared to BALD, particularly in prediction-oriented settings. The authors demonstrate EPIG's ability to target information gain in predictions, leading to improved performance across various datasets and models. The paper concludes that EPIG can serve as a compelling drop-in replacement for BALD, with potential for significant performance gains when dealing with large, diverse pools of unlabeled data.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 19 likes about this paper.