RLNLC: Reinforcement Learning for Noisy Labels
- The paper introduces a novel framework that formulates noisy label correction as an MDP and demonstrates significant accuracy improvements (e.g., 95.8% on CIFAR10-IDN) over prior baselines.
- RLNLC is a reinforcement learning method that integrates state and action space definitions with deterministic transitions, leveraging deep feature extraction to iteratively cleanse labels.
- The method employs a composite reward design combining label consistency and noisy label alignment with a k-NN attention mechanism to robustly guide the actor-critic optimization process.
Reinforcement Learning for Noisy Label Correction (RLNLC) is a methodology that formulates the correction of noisy labels in supervised learning as an explicit Markov decision process (MDP). It architecturally aligns the label cleansing procedure with a reinforcement learning (RL) paradigm, leveraging an actor-critic framework. The approach is instantiated by defining state and action spaces over the full dataset and possible label correction operations, a deterministic transition model, and a composite reward function that robustly quantifies label quality. RLNLC deploys a deep feature representation-based policy, optimized via actor-critic reinforcement learning, to iteratively improve label fidelity, thereby facilitating more reliable downstream model training. Extensive empirical benchmarks demonstrate substantial improvements over established noisy-label baselines, particularly in high-noise regimes (Heidari et al., 25 Nov 2025).
1. Markov Decision Process Formulation
RLNLC models the noisy label correction process as an MDP :
- State space (): At RL step , the state is the entire dataset , with each being a current one-hot or soft label. The initial state is formed by further randomly corrupting a subset of the originally noisy labels to promote exploratory behavior among policies.
- Action space (): Each action is a binary vector , determining for each datapoint whether its label should be corrected (1: replace with k-NN–predicted soft label; 0: retain original).
- Transition model (): Deterministic transitions; for each ,
where is computed via an attention-weighted -NN over the feature embedding space.
- Reward (): Composite of two terms:
- Label Consistency Reward (LCR): Quantifies overall label smoothness using expected negative KL divergence between the new label and k-NN attention-aggregated neighbors in a frozen feature space :
- Noisy Label Alignment Reward (NLA): For corrected (“noisy”) labels, their KL divergence to a k-NN mean among “clean” points:
- Final reward: Exponential combination,
with modulating the contribution of NLA.
Discount factor (): Default .
2. Actor–Critic Architecture and Learning Objectives
Policy network (actor): The actor’s sole learnable parameters are , the weights of the feature extractor (ResNet-variant). Initial pre-training of and classifier is performed using standard cross-entropy on raw noisy data; only is updated during RL.
Action probability for label is defined by the discrepancy between the k-NN–softmaxed label and the current label:
.
Critic network (): Estimates expected return using a multilayer perceptron over a “binned” histogram vector of per-sample label smoothness.
Actor objective:
Parameters are updated by policy gradient ascent with step .
- Critic objective (SARSA TD):
.
3. Algorithmic Procedure
The RLNLC training procedure operates as follows:
Initialization:
- Pre-train feature extractor and classifier.
- Form by additional random corruption to encourage exploration.
- Reinforcement learning loop (for RL steps per epoch, repeated for multiple epochs):
- Compute k-NN neighborhoods and attention for all samples.
- Compute action probabilities and sample .
- Apply deterministic transitions to update labels.
- Evaluate reward components (LCR, NLA, final reward).
- Update actor and critic parameters using respective gradients.
- After RL convergence, deploy the learned policy for cleaning steps starting from the original noisy labels.
- Fine-tune classifier on the cleaned dataset .
Default hyperparameters:
, , , , , , , SGD momentum 0.9, batch size 128, weight decay , , (Heidari et al., 25 Nov 2025).
4. Empirical Evaluation and Results
RLNLC has been benchmarked across both synthetic and real noisy label scenarios:
| Dataset/type | Noise Rate(s) | RLNLC Accuracy | Best Prior Baseline | Baseline Value |
|---|---|---|---|---|
| CIFAR10-IDN | 50% IDN | 95.8% | SSR | 94.1% |
| CIFAR100-IDN | 50% IDN | 74.7% | SSR | 72.8% |
| Animal-10N | ≈8% | 90.2% | SURE | 89.0% |
| Food-101N | ≈20% | 89.2% | LongReMix | 87.3% |
| CIFAR100 90% symmetric | 90% sym | 44.2% | DivideMix | 31.0% |
| CIFAR10 90% symmetric | 90% sym | 82.1% | DivideMix | 75.4% |
- Datasets include CIFAR10-IDN, CIFAR100-IDN (instance-dependent noise), Animal-10N (real noise), Food-101N (web noise), and class-conditional symmetric noise scenarios.
- RLNLC outperforms standard baselines such as CE, DivideMix, SSR, LongReMix, SURE, Decoupling, Co-teaching(+), MentorNet, CausalNL, CleanNet, PLC, and others.
- Ablation studies show both the LCR and NLA rewards are critical, each contributing losses of 2–3pp when removed. Initial-state randomization and decoupling from each yield ≈1–2pp drops in accuracy.
5. Reward Design and k-NN Attention Mechanism
RLNLC's reward structure utilizes k-NN–based attention mechanisms in both dynamic label update and evaluation:
- k-NN attention: For each sample, a soft label is formed by aggregating its k-NN label vectors via attention weighting:
where denotes cosine similarity and is the attention temperature.
- Label Consistency: Negative KL-divergence between updated and attention-averaged labels measures label smoothness after correction.
- Noisy Label Alignment: For corrected labels, the KL-divergence to “clean” label aggregations encourages newly assigned soft labels to conform to high-confidence exemplars.
6. Comparative Analysis and Ablations
Comprehensive ablation experiments indicate:
- Both LCR and NLA reward terms are crucial; removing either results in 2–3 percentage point test accuracy drops on CIFAR100-IDN.
- Initial-state randomization and keeping a fixed for reward computation are each incrementally beneficial (~1–2pp).
- The method’s design—large state space, k-NN attention, and hybrid exploration via label re-corruption—provides systematic improvements under diverse, high-noise conditions.
7. Implementation Summary and Reproducibility
The RLNLC framework is fully specified via MDP components, policy/critic architectures, training loops, and hyperparameter choices, supporting complete reproducibility (Heidari et al., 25 Nov 2025). The core implementation consists of:
- Pre-training, RL actor–critic optimization, deployment for label cleaning, and final supervised fine-tuning.
- All essential mathematical expressions, pseudocode, and procedural steps are delineated for faithful re-implementation.
- Evaluation protocols, baseline comparisons, and ablation structures are aligned with prevailing benchmarks for noisy-label learning research.
The formalization of noisy label correction as an actor-critic reinforcement learning problem, with explicit MDP construction, composite reward mechanisms, and the integration of k-NN attention over learned representations, distinguishes RLNLC within the landscape of label-noise robust learning. Performance across challenging real and synthetic settings, coupled with thorough ablations, underscores the method’s empirical and methodological contributions (Heidari et al., 25 Nov 2025).