Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
GPT-5.1
GPT-5.1 91 tok/s
Gemini 3.0 Pro 46 tok/s Pro
Gemini 2.5 Flash 148 tok/s Pro
Kimi K2 170 tok/s Pro
Claude Sonnet 4.5 34 tok/s Pro
2000 character limit reached

Yes–No Bundle with Negative Risk (YNB-NR)

Updated 12 November 2025
  • YNB-NR is a phenomenon in positive-unlabeled learning where negative empirical risks lead to overfitting in highly flexible models.
  • The non-negative risk estimator replaces the negative risk term with its non-negative counterpart, ensuring the empirical risk remains bounded and reliable.
  • Empirical results on deep models and large-scale datasets demonstrate that nnPU significantly improves classifier robustness compared to unbiased estimators.

A Yes–No Bundle with Negative Risk (abbreviated as YNB-NR, Editor's term) refers to the situation in positive-unlabeled (PU) learning where a binary classifier, trained to distinguish "yes" (positive) examples from both "no" (negative) and unlabeled examples, suffers from negative-valued empirical risk due to the structure of standard unbiased PU risk estimators. This negative-risk phenomenon is especially prevalent when employing highly flexible models (e.g., deep neural networks) and can lead to severe overfitting, undermining the reliability of the resulting classifier. The non-negative risk estimator (nnPU) was proposed to address the YNB-NR pathology by replacing the core risk term with its non-negative part, restoring robustness without loss in statistical consistency (Kiryo et al., 2017).

1. PU Classification and the Structure of Risk Estimators

In the PU learning framework, the objective is to learn a binary decision function g:RdRg: \mathbb{R}^d \to \mathbb{R} from positive (PP) and unlabeled (UU) data, given an unknown class prior p=P(Y=+1)p = \mathbb{P}(Y=+1). The standard risk for binary classification is decomposed as:

R(g)=pR+(g)+nR(g),n=1p,R(g) = p R^+(g) + n R^-(g), \quad n = 1 - p,

where R+(g)=EXp(x)[(g(X),+1)]R^+(g) = \mathbb{E}_{X \sim p(x)}[\ell(g(X),+1)] and R(g)=EXn(x)[(g(X),1)]R^-(g) = \mathbb{E}_{X \sim n(x)}[\ell(g(X),-1)]; (,)\ell(\cdot, \cdot) is the loss function.

In practice, only PP and UU samples are available. The unbiased PU empirical risk is constructed as:

R^PU(g)=pR^+pR^+U^,\widehat R_{\rm PU}(g) = p \widehat R^+ - p \widehat R^- + \widehat U^-,

with empirical averages R^+,R^,U^\widehat R^+, \widehat R^-, \widehat U^- calculated from the positive and unlabeled samples as described in the original paper. Crucially, the pR^-p \widehat R^- subtraction introduces the possibility for R^PU(g)\widehat R_{\rm PU}(g) to become negative and arbitrarily small, especially with expressive models and unbounded loss functions.

2. Negative Risk and Overfitting Pathology

The capacity for R^PU(g)\widehat R_{\rm PU}(g) to attain negative values (i.e., the negative risk issue) is central to the YNB-NR pathology. Unlike the true risk, which satisfies R(g)0R(g) \geq 0, the empirical risk can become unbounded below for flexible models, since the pR^-p \widehat R^- term is optimized by minimizing R^\widehat R^- on positive data, sometimes at the expense of generalization. This leads to severe overfitting, wherein the classifier fits noise or outliers in the positive set, driving down empirical risk beyond the meaningful range.

Empirical results from deep models (e.g., multi-layer perceptrons with ReLU or Softsign, CNNs) demonstrate that, under the unbiased estimator, the training loss can decrease below zero while the test loss increases, exposing the overfitting incurred by the negative risk estimator.

3. Non-Negative Risk Estimator

To address the negative-risk phenomenon, Kiryo et al. propose the non-negative risk estimator:

R~PU(g)=pR^++max{0,U^pR^}.\widetilde R_{\rm PU}(g) = p \widehat R^+ + \max \left\{ 0, \widehat U^- - p \widehat R^- \right\}.

Here, the term U^pR^\widehat U^- - p \widehat R^- estimates the contribution of unseen negative data; by thresholding at zero, the estimator ensures empirical risks remain non-negative, matching the theoretical lower bound of the true risk. This modification both enforces non-negativity and regularizes against the excessive negative bias responsible for overfitting.

The intuition is that the negative part of the empirical risk does not contain meaningful information about classifier quality but rather reflects over-exploitation of finite-sample noise and model flexibility. The non-negative correction removes this spurious signal without sacrificing estimator consistency.

4. Theoretical Guarantees

Bias and Consistency

Under mild conditions (bounded loss \ell, minimal signal in n(g)n^-(g)), the bias of the non-negative estimator decays exponentially with sample size:

0E[R~PU(g)]R(g)CpΔg,0 \leq \mathbb{E}[\widetilde R_{\rm PU}(g)] - R(g) \leq C_\ell p \Delta_g,

where CC_\ell bounds the loss and Δg\Delta_g is exponentially small in the sample size. Moreover, the estimation error satisfies, with probability at least 1δ1-\delta,

R~PU(g)R(g)Cln(2/δ)2(2pn++1nu)+CpΔg.|\widetilde R_{\rm PU}(g) - R(g)| \leq C_\ell \sqrt{\frac{\ln(2/\delta)}{2}} \left( \frac{2p}{\sqrt{n_+}} + \frac{1}{\sqrt{n_u}} \right) + C_\ell p \Delta_g.

Mean-Squared Error

The non-negative estimator achieves strictly lower mean-squared error (MSE) than the unbiased estimator when \ell satisfies (t,+1)+(t,1)=1\ell(t, +1) + \ell(t, -1) = 1 and under other mild assumptions. Quantitatively,

MSE(R~PU)<MSE(R^PU).{\rm MSE}(\widetilde R_{\rm PU}) < {\rm MSE}(\widehat R_{\rm PU}).

If a tolerance β0\beta \geq 0 is accepted, the reduction in MSE is further bounded as

MSE(R^)MSE(R~)3β2Pr{R^PU(g)R~PU(g)>β}.{\rm MSE}(\widehat R) - {\rm MSE}(\widetilde R) \geq 3 \beta^2 \Pr \left\{ \widehat R_{\rm PU}(g) - \widetilde R_{\rm PU}(g) > \beta \right\}.

Generalization and Estimation Error

Let gg^* denote the true risk minimizer and g^\widehat{g} the empirical minimizer under R~PU\widetilde{R}_{\rm PU} across a hypothesis class G\mathcal{G}. With standard complexity controls (Lipschitzness, Rademacher complexities), the following bound holds with high probability:

R(g^)R(g)O(Rn+,p(G))+O(Rnu,p(G))+O(pΔg)+O(pn++1nu),R(\widehat g) - R(g^*) \leq O(\mathfrak{R}_{n_+,p}(\mathcal{G})) + O(\mathfrak{R}_{n_u,p}(\mathcal{G})) + O(p \Delta_g) + O\left( \frac{p}{\sqrt{n_+}} + \frac{1}{\sqrt{n_u}} \right),

demonstrating that the learning rate is unaffected by the non-negative correction.

5. Algorithmic Implementation

In large-scale scenarios, the non-negativity enforcement is incorporated into minibatch-based stochastic optimization. The core procedure maintains the clipping behavior inside each optimization step, supporting efficient execution with deep nets.

PU-ERM Algorithm (Minibatch-based):

1
2
3
4
5
6
7
8
r_i = U_minus(g; U^i) - p * R_minus(g; P^i)
if r_i >= -β:
    grad = θ [p * R_plus(g; P^i) + r_i]
    step_size = η
else:
    grad = θ [p * R_plus(g; P^i) - r_i]
    step_size = γ * η
θ = optimizer_update(θ, grad, step_size)
Here, R+R^+ and RR^- denote per-minibatch empirical averages; β\beta is a tolerance parameter, and γ\gamma discounts learning rate when the risk would go below threshold. Empirically, this operation is computationally negligible in modern frameworks.

6. Experimental Evaluation and Empirical Findings

Evaluation was conducted on diverse datasets: MNIST (even vs. odd), epsilon (LIBSVM), 20Newsgroups (binary subset), and CIFAR-10 (vehicles vs. animals). The positive prior pp ranged from 0.4 to 0.5; models included 6-layer MLPs, embedding-based text networks, and deep CNNs with over 13 layers, using sigmoid surrogate loss and 2\ell_2 regularization. Both Adam and AdaGrad optimizers were utilized, with typical minibatch sizes (e.g., 128).

Key empirical observations:

  • The unbiased PU estimator ("uPU") produces negative and unbounded empirical risks, resulting in pronounced overfitting as model flexibility increases.
  • The non-negative estimator ("nnPU") consistently maintains the training risk above zero, with no observed overfitting, even for deep networks or limited positive samples.
  • nnPU often yields lower test errors compared to both the unbiased PU estimator and traditional positive–negative (PN) learners, particularly when negative labels are scarce.
  • nnPU is robust to mild over-estimation of the class prior pp, whereas under-estimation degrades performance more significantly.

7. Implications and Practical Significance

The introduction of the non-negative risk estimator fundamentally addresses the risk negativity inherent in standard PU learning workflows, especially in deep learning contexts. This enables the design of robust YNB classifiers in settings where negative-labeled data is unavailable or expensive to acquire, extending the practical applicability of PU learning. The consistent empirical improvements and preserved theoretical guarantees underline the efficacy and generalizability of this approach. A plausible implication is that the non-negativity principle may serve as a general regularization tool in other semi-supervised or weakly supervised risk estimation settings.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Yes–No Bundle with Negative Risk (YNB-NR).