Papers
Topics
Authors
Recent
2000 character limit reached

RLNLC: Reinforcement Learning for Noisy Labels

Updated 2 December 2025
  • The paper introduces a novel framework that formulates noisy label correction as an MDP and demonstrates significant accuracy improvements (e.g., 95.8% on CIFAR10-IDN) over prior baselines.
  • RLNLC is a reinforcement learning method that integrates state and action space definitions with deterministic transitions, leveraging deep feature extraction to iteratively cleanse labels.
  • The method employs a composite reward design combining label consistency and noisy label alignment with a k-NN attention mechanism to robustly guide the actor-critic optimization process.

Reinforcement Learning for Noisy Label Correction (RLNLC) is a methodology that formulates the correction of noisy labels in supervised learning as an explicit Markov decision process (MDP). It architecturally aligns the label cleansing procedure with a reinforcement learning (RL) paradigm, leveraging an actor-critic framework. The approach is instantiated by defining state and action spaces over the full dataset and possible label correction operations, a deterministic transition model, and a composite reward function that robustly quantifies label quality. RLNLC deploys a deep feature representation-based policy, optimized via actor-critic reinforcement learning, to iteratively improve label fidelity, thereby facilitating more reliable downstream model training. Extensive empirical benchmarks demonstrate substantial improvements over established noisy-label baselines, particularly in high-noise regimes (Heidari et al., 25 Nov 2025).

1. Markov Decision Process Formulation

RLNLC models the noisy label correction process as an MDP M=(S,A,P,R,γ)\mathcal{M} = (\mathcal{S}, \mathcal{A}, P, \mathcal{R}, \gamma):

  • State space (S\mathcal{S}): At RL step tt, the state is the entire dataset st=(X,Y^t)s^t = (X, \widehat{Y}^t), with each y^it\widehat{\mathbf{y}}^t_i being a current one-hot or soft label. The initial state s0s^0 is formed by further randomly corrupting a subset of the originally noisy labels to promote exploratory behavior among policies.
  • Action space (A\mathcal{A}): Each action ata^t is a binary vector [a1t,,aNt][a^t_1, \dots, a^t_N], determining for each datapoint whether its label should be corrected (1: replace with k-NN–predicted soft label; 0: retain original).
  • Transition model (PP): Deterministic transitions; for each ii,

y^it+1={y^itifait=0, yˉiifait=1,\widehat{\mathbf{y}}^{t+1}_i = \begin{cases} \widehat{\mathbf{y}}^{t}_i & \text{if}\quad a^t_i=0, \ \bar{\mathbf{y}}_i & \text{if}\quad a^t_i=1, \end{cases}

where yˉi\bar{\mathbf{y}}_i is computed via an attention-weighted kk-NN over the feature embedding space.

  • Reward (R(st,at)\mathcal{R}(s^t, a^t)): Composite of two terms:
    • Label Consistency Reward (LCR): Quantifies overall label smoothness using expected negative KL divergence between the new label and k-NN attention-aggregated neighbors in a frozen feature space fωf_\omega:

    RLCR(st,at)=1Ni=1NKL(y^it+1jNω(xi)αijy^jt+1)\mathcal{R}_{\mathrm{LCR}}(s^t, a^t) = -\frac{1}{N}\sum_{i=1}^N \mathrm{KL}\Big(\widehat{\mathbf{y}}^{t+1}_i\,\Vert\,\sum_{j\in \mathcal{N}_\omega(\mathbf{x}_i)} \alpha_{ij} \widehat{\mathbf{y}}^{t+1}_j\Big) - Noisy Label Alignment Reward (NLA): For corrected (“noisy”) labels, their KL divergence to a k-NN mean among “clean” points:

    RNLA(st,at)=1Dnoit+1iDnoit+1KL(y^it+1jNcle(xi)αijy^jt+1)\mathcal{R}_{\mathrm{NLA}}(s^t, a^t) = -\frac{1}{|\mathcal{D}^{t+1}_{\text{noi}}|} \sum_{i\in \mathcal{D}^{t+1}_{\text{noi}}} \mathrm{KL}\Big(\widehat{\mathbf{y}}^{t+1}_i\,\Vert\,\sum_{j\in\mathcal{N}_{\mathrm{cle}}(\mathbf{x}_i)}\alpha_{ij}\widehat{\mathbf{y}}^{t+1}_j\Big) - Final reward: Exponential combination,

    R(st,at)=exp(RLCR(st,at)+λRNLA(st,at))(0,1]\mathcal{R}(s^t, a^t) = \exp\left(\mathcal{R}_{\mathrm{LCR}}(s^t, a^t) + \lambda \mathcal{R}_{\mathrm{NLA}}(s^t, a^t)\right) \in (0, 1]

    with λ\lambda modulating the contribution of NLA.

  • Discount factor (γ\gamma): Default γ=0.9\gamma=0.9.

2. Actor–Critic Architecture and Learning Objectives

  • Policy network (actor): The actor’s sole learnable parameters are θ\theta, the weights of the feature extractor fθf_\theta (ResNet-variant). Initial pre-training of fθf_\theta and classifier hψh_\psi is performed using standard cross-entropy on raw noisy data; only fθf_\theta is updated during RL.

  • Action probability for label ii is defined by the discrepancy between the k-NN–softmaxed label yˉi\bar{\mathbf{y}}_i and the current label:

pi=c:yˉi,c>yˉi,y^iyˉi,cc:yˉi,cyˉi,y^iyˉi,c,pi[0,1]p_i = \frac{\sum_{c:\,\bar y_{i,c} > \bar y_{i,\widehat y_i}} \bar y_{i,c}} {\sum_{c:\,\bar y_{i,c}\ge \bar y_{i,\widehat y_i}} \bar y_{i,c}},\quad p_i\in[0,1]

  • aitBernoulli(pi)a^t_i \sim \mathrm{Bernoulli}(p_i).

  • Critic network (QϕQ_\phi): Estimates expected return Q(s,a)Q(s, a) using a multilayer perceptron over a “binned” histogram vector of per-sample label smoothness.

  • Actor objective:

J(θ)=Esρθ,aπθ[logπθ(as)Qϕ(s,a)]J(\theta) = \mathbb{E}_{s\sim\rho_\theta,\,a\sim\pi_\theta}\big[\log\pi_\theta(a\mid s) Q_\phi(s, a)\big]

Parameters are updated by policy gradient ascent with step βθ\beta_\theta.

  • Critic objective (SARSA TD):

δt=R(st,at)+γQϕ(st+1,at+1)Qϕ(st,at)\delta^t = \mathcal{R}(s^t, a^t) + \gamma Q_\phi(s^{t+1}, a^{t+1}) - Q_\phi(s^t, a^t)

ϕϕ+βδtϕQϕ(st,at)\phi \leftarrow \phi + \beta\,\delta^t\,\nabla_\phi Q_\phi(s^t, a^t).

3. Algorithmic Procedure

The RLNLC training procedure operates as follows:

  1. Initialization:

    • Pre-train feature extractor and classifier.
    • Form s0s^0 by additional random corruption to encourage exploration.
  2. Reinforcement learning loop (for TT RL steps per epoch, repeated for multiple epochs):
    • Compute k-NN neighborhoods and attention for all samples.
    • Compute action probabilities and sample ata^t.
    • Apply deterministic transitions to update labels.
    • Evaluate reward components (LCR, NLA, final reward).
    • Update actor and critic parameters using respective gradients.
    • After RL convergence, deploy the learned policy for TT' cleaning steps starting from the original noisy labels.
    • Fine-tune classifier hψfθh_\psi \circ f_\theta on the cleaned dataset (X,Y^T)(X, \widehat{Y}^{T'}).

Default hyperparameters:

k=10k=10, τ=0.5\tau=0.5, λ=0.5\lambda=0.5, γ=0.9\gamma=0.9, Nb=100N_b=100, T=10T=10, T=25T'=25, SGD momentum 0.9, batch size 128, weight decay 5×1045 \times 10^{-4}, βθ=0.01\beta_\theta = 0.01, β=0.01\beta = 0.01 (Heidari et al., 25 Nov 2025).

4. Empirical Evaluation and Results

RLNLC has been benchmarked across both synthetic and real noisy label scenarios:

Dataset/type Noise Rate(s) RLNLC Accuracy Best Prior Baseline Baseline Value
CIFAR10-IDN 50% IDN 95.8% SSR 94.1%
CIFAR100-IDN 50% IDN 74.7% SSR 72.8%
Animal-10N ≈8% 90.2% SURE 89.0%
Food-101N ≈20% 89.2% LongReMix 87.3%
CIFAR100 90% symmetric 90% sym 44.2% DivideMix 31.0%
CIFAR10 90% symmetric 90% sym 82.1% DivideMix 75.4%
  • Datasets include CIFAR10-IDN, CIFAR100-IDN (instance-dependent noise), Animal-10N (real noise), Food-101N (web noise), and class-conditional symmetric noise scenarios.
  • RLNLC outperforms standard baselines such as CE, DivideMix, SSR, LongReMix, SURE, Decoupling, Co-teaching(+), MentorNet, CausalNL, CleanNet, PLC, and others.
  • Ablation studies show both the LCR and NLA rewards are critical, each contributing losses of 2–3pp when removed. Initial-state randomization and decoupling fωf_\omega from fθf_\theta each yield ≈1–2pp drops in accuracy.

5. Reward Design and k-NN Attention Mechanism

RLNLC's reward structure utilizes k-NN–based attention mechanisms in both dynamic label update and evaluation:

  • k-NN attention: For each sample, a soft label is formed by aggregating its k-NN label vectors via attention weighting:

yˉi=jN(xi)αijy^jt\bar{\mathbf{y}}_i = \sum_{j \in \mathcal{N}(\mathbf{x}_i)} \alpha_{ij} \widehat{\mathbf{y}}^t_j

αij=exp(sim(fθ(xi),fθ(xj))/τ)jexp(sim(fθ(xi),fθ(xj))/τ)\alpha_{ij} = \frac{\exp(\mathrm{sim}(f_\theta(\mathbf{x}_i), f_\theta(\mathbf{x}_j))/\tau)}{\sum_{j'} \exp(\mathrm{sim}(f_\theta(\mathbf{x}_i), f_\theta(\mathbf{x}_{j'}))/\tau)}

where sim\mathrm{sim} denotes cosine similarity and τ\tau is the attention temperature.

  • Label Consistency: Negative KL-divergence between updated and attention-averaged labels measures label smoothness after correction.
  • Noisy Label Alignment: For corrected labels, the KL-divergence to “clean” label aggregations encourages newly assigned soft labels to conform to high-confidence exemplars.

6. Comparative Analysis and Ablations

Comprehensive ablation experiments indicate:

  • Both LCR and NLA reward terms are crucial; removing either results in 2–3 percentage point test accuracy drops on CIFAR100-IDN.
  • Initial-state randomization and keeping a fixed fωf_\omega for reward computation are each incrementally beneficial (~1–2pp).
  • The method’s design—large state space, k-NN attention, and hybrid exploration via label re-corruption—provides systematic improvements under diverse, high-noise conditions.

7. Implementation Summary and Reproducibility

The RLNLC framework is fully specified via MDP components, policy/critic architectures, training loops, and hyperparameter choices, supporting complete reproducibility (Heidari et al., 25 Nov 2025). The core implementation consists of:

  • Pre-training, RL actor–critic optimization, deployment for label cleaning, and final supervised fine-tuning.
  • All essential mathematical expressions, pseudocode, and procedural steps are delineated for faithful re-implementation.
  • Evaluation protocols, baseline comparisons, and ablation structures are aligned with prevailing benchmarks for noisy-label learning research.

The formalization of noisy label correction as an actor-critic reinforcement learning problem, with explicit MDP construction, composite reward mechanisms, and the integration of k-NN attention over learned representations, distinguishes RLNLC within the landscape of label-noise robust learning. Performance across challenging real and synthetic settings, coupled with thorough ablations, underscores the method’s empirical and methodological contributions (Heidari et al., 25 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Reinforcement Learning for Noisy Label Correction (RLNLC).