RLNLC: Reinforcement Learning for Noisy Labels

Updated 2 December 2025

The paper introduces a novel framework that formulates noisy label correction as an MDP and demonstrates significant accuracy improvements (e.g., 95.8% on CIFAR10-IDN) over prior baselines.
RLNLC is a reinforcement learning method that integrates state and action space definitions with deterministic transitions, leveraging deep feature extraction to iteratively cleanse labels.
The method employs a composite reward design combining label consistency and noisy label alignment with a k-NN attention mechanism to robustly guide the actor-critic optimization process.

Reinforcement Learning for Noisy Label Correction (RLNLC) is a methodology that formulates the correction of noisy labels in supervised learning as an explicit Markov decision process (MDP). It architecturally aligns the label cleansing procedure with a reinforcement learning (RL) paradigm, leveraging an actor-critic framework. The approach is instantiated by defining state and action spaces over the full dataset and possible label correction operations, a deterministic transition model, and a composite reward function that robustly quantifies label quality. RLNLC deploys a deep feature representation-based policy, optimized via actor-critic reinforcement learning, to iteratively improve label fidelity, thereby facilitating more reliable downstream model training. Extensive empirical benchmarks demonstrate substantial improvements over established noisy-label baselines, particularly in high-noise regimes (Heidari et al., 25 Nov 2025).

1. Markov Decision Process Formulation

RLNLC models the noisy label correction process as an MDP $\mathcal{M} = (\mathcal{S}, \mathcal{A}, P, \mathcal{R}, \gamma)$ :

State space ( $\mathcal{S}$ ): At RL step $t$ , the state is the entire dataset $s^t = (X, \widehat{Y}^t)$ , with each $\widehat{\mathbf{y}}^t_i$ being a current one-hot or soft label. The initial state $s^0$ is formed by further randomly corrupting a subset of the originally noisy labels to promote exploratory behavior among policies.
Action space ( $\mathcal{A}$ ): Each action $a^t$ is a binary vector $[a^t_1, \dots, a^t_N]$ , determining for each datapoint whether its label should be corrected (1: replace with k-NN–predicted soft label; 0: retain original).
Transition model ( $P$ ): Deterministic transitions; for each $i$ ,

$\widehat{\mathbf{y}}^{t+1}_i = \begin{cases} \widehat{\mathbf{y}}^{t}_i & \text{if}\quad a^t_i=0, \ \bar{\mathbf{y}}_i & \text{if}\quad a^t_i=1, \end{cases}$

where $\bar{\mathbf{y}}_i$ is computed via an attention-weighted $k$ -NN over the feature embedding space.

Reward ( $\mathcal{R}(s^t, a^t)$ ): Composite of two terms:
- Label Consistency Reward (LCR): Quantifies overall label smoothness using expected negative KL divergence between the new label and k-NN attention-aggregated neighbors in a frozen feature space $f_\omega$ :
$\mathcal{R}_{\mathrm{LCR}}(s^t, a^t) = -\frac{1}{N}\sum_{i=1}^N \mathrm{KL}\Big(\widehat{\mathbf{y}}^{t+1}_i\,\Vert\,\sum_{j\in \mathcal{N}_\omega(\mathbf{x}_i)} \alpha_{ij} \widehat{\mathbf{y}}^{t+1}_j\Big)$ - Noisy Label Alignment Reward (NLA): For corrected (“noisy”) labels, their KL divergence to a k-NN mean among “clean” points:

$\mathcal{R}_{\mathrm{NLA}}(s^t, a^t) = -\frac{1}{|\mathcal{D}^{t+1}_{\text{noi}}|} \sum_{i\in \mathcal{D}^{t+1}_{\text{noi}}} \mathrm{KL}\Big(\widehat{\mathbf{y}}^{t+1}_i\,\Vert\,\sum_{j\in\mathcal{N}_{\mathrm{cle}}(\mathbf{x}_i)}\alpha_{ij}\widehat{\mathbf{y}}^{t+1}_j\Big)$ - Final reward: Exponential combination,

$\mathcal{R}(s^t, a^t) = \exp\left(\mathcal{R}_{\mathrm{LCR}}(s^t, a^t) + \lambda \mathcal{R}_{\mathrm{NLA}}(s^t, a^t)\right) \in (0, 1]$

with $\lambda$ modulating the contribution of NLA.
Discount factor ( $\gamma$ ): Default $\gamma=0.9$ .

2. Actor–Critic Architecture and Learning Objectives

Policy network (actor): The actor’s sole learnable parameters are $\theta$ , the weights of the feature extractor $f_\theta$ (ResNet-variant). Initial pre-training of $f_\theta$ and classifier $h_\psi$ is performed using standard cross-entropy on raw noisy data; only $f_\theta$ is updated during RL.
Action probability for label $i$ is defined by the discrepancy between the k-NN–softmaxed label $\bar{\mathbf{y}}_i$ and the current label:

$p_i = \frac{\sum_{c:\,\bar y_{i,c} > \bar y_{i,\widehat y_i}} \bar y_{i,c}} {\sum_{c:\,\bar y_{i,c}\ge \bar y_{i,\widehat y_i}} \bar y_{i,c}},\quad p_i\in[0,1]$

$a^t_i \sim \mathrm{Bernoulli}(p_i)$ .
Critic network ( $Q_\phi$ ): Estimates expected return $Q(s, a)$ using a multilayer perceptron over a “binned” histogram vector of per-sample label smoothness.
Actor objective:

$J(\theta) = \mathbb{E}_{s\sim\rho_\theta,\,a\sim\pi_\theta}\big[\log\pi_\theta(a\mid s) Q_\phi(s, a)\big]$

Parameters are updated by policy gradient ascent with step $\beta_\theta$ .

Critic objective (SARSA TD):

$\delta^t = \mathcal{R}(s^t, a^t) + \gamma Q_\phi(s^{t+1}, a^{t+1}) - Q_\phi(s^t, a^t)$

$\phi \leftarrow \phi + \beta\,\delta^t\,\nabla_\phi Q_\phi(s^t, a^t)$ .

3. Algorithmic Procedure

The RLNLC training procedure operates as follows:

Initialization:
- Pre-train feature extractor and classifier.
- Form $s^0$ by additional random corruption to encourage exploration.
Reinforcement learning loop (for $T$ RL steps per epoch, repeated for multiple epochs):
- Compute k-NN neighborhoods and attention for all samples.
- Compute action probabilities and sample $a^t$ .
- Apply deterministic transitions to update labels.
- Evaluate reward components (LCR, NLA, final reward).
- Update actor and critic parameters using respective gradients.
- After RL convergence, deploy the learned policy for $T'$ cleaning steps starting from the original noisy labels.
- Fine-tune classifier $h_\psi \circ f_\theta$ on the cleaned dataset $(X, \widehat{Y}^{T'})$ .

Default hyperparameters:

$k=10$ , $\tau=0.5$ , $\lambda=0.5$ , $\gamma=0.9$ , $N_b=100$ , $T=10$ , $T'=25$ , SGD momentum 0.9, batch size 128, weight decay $5 \times 10^{-4}$ , $\beta_\theta = 0.01$ , $\beta = 0.01$ (Heidari et al., 25 Nov 2025).

4. Empirical Evaluation and Results

RLNLC has been benchmarked across both synthetic and real noisy label scenarios:

Dataset/type	Noise Rate(s)	RLNLC Accuracy	Best Prior Baseline	Baseline Value
CIFAR10-IDN	50% IDN	95.8%	SSR	94.1%
CIFAR100-IDN	50% IDN	74.7%	SSR	72.8%
Animal-10N	≈8%	90.2%	SURE	89.0%
Food-101N	≈20%	89.2%	LongReMix	87.3%
CIFAR100 90% symmetric	90% sym	44.2%	DivideMix	31.0%
CIFAR10 90% symmetric	90% sym	82.1%	DivideMix	75.4%

Datasets include CIFAR10-IDN, CIFAR100-IDN (instance-dependent noise), Animal-10N (real noise), Food-101N (web noise), and class-conditional symmetric noise scenarios.
RLNLC outperforms standard baselines such as CE, DivideMix, SSR, LongReMix, SURE, Decoupling, Co-teaching(+), MentorNet, CausalNL, CleanNet, PLC, and others.
Ablation studies show both the LCR and NLA rewards are critical, each contributing losses of 2–3pp when removed. Initial-state randomization and decoupling $f_\omega$ from $f_\theta$ each yield ≈1–2pp drops in accuracy.

5. Reward Design and k-NN Attention Mechanism

RLNLC's reward structure utilizes k-NN–based attention mechanisms in both dynamic label update and evaluation:

k-NN attention: For each sample, a soft label is formed by aggregating its k-NN label vectors via attention weighting:

$\bar{\mathbf{y}}_i = \sum_{j \in \mathcal{N}(\mathbf{x}_i)} \alpha_{ij} \widehat{\mathbf{y}}^t_j$

$\alpha_{ij} = \frac{\exp(\mathrm{sim}(f_\theta(\mathbf{x}_i), f_\theta(\mathbf{x}_j))/\tau)}{\sum_{j'} \exp(\mathrm{sim}(f_\theta(\mathbf{x}_i), f_\theta(\mathbf{x}_{j'}))/\tau)}$

where $\mathrm{sim}$ denotes cosine similarity and $\tau$ is the attention temperature.

Label Consistency: Negative KL-divergence between updated and attention-averaged labels measures label smoothness after correction.
Noisy Label Alignment: For corrected labels, the KL-divergence to “clean” label aggregations encourages newly assigned soft labels to conform to high-confidence exemplars.

6. Comparative Analysis and Ablations

Comprehensive ablation experiments indicate:

Both LCR and NLA reward terms are crucial; removing either results in 2–3 percentage point test accuracy drops on CIFAR100-IDN.
Initial-state randomization and keeping a fixed $f_\omega$ for reward computation are each incrementally beneficial (~1–2pp).
The method’s design—large state space, k-NN attention, and hybrid exploration via label re-corruption—provides systematic improvements under diverse, high-noise conditions.

7. Implementation Summary and Reproducibility

The RLNLC framework is fully specified via MDP components, policy/critic architectures, training loops, and hyperparameter choices, supporting complete reproducibility (Heidari et al., 25 Nov 2025). The core implementation consists of:

Pre-training, RL actor–critic optimization, deployment for label cleaning, and final supervised fine-tuning.
All essential mathematical expressions, pseudocode, and procedural steps are delineated for faithful re-implementation.
Evaluation protocols, baseline comparisons, and ablation structures are aligned with prevailing benchmarks for noisy-label learning research.

The formalization of noisy label correction as an actor-critic reinforcement learning problem, with explicit MDP construction, composite reward mechanisms, and the integration of k-NN attention over learned representations, distinguishes RLNLC within the landscape of label-noise robust learning. Performance across challenging real and synthetic settings, coupled with thorough ablations, underscores the method’s empirical and methodological contributions (Heidari et al., 25 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

Learning to Clean: Reinforcement Learning for Noisy Label Correction (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Reinforcement Learning for Noisy Label Correction (RLNLC).