Papers
Topics
Authors
Recent
Search
2000 character limit reached

Cleaning Automatically via Self-Training (CAST)

Updated 17 June 2026
  • The paper demonstrates how CAST leverages self-training with adaptive filtering to progressively clean noisy large-scale data and enhance model quality.
  • CAST’s methodologies—including class-adaptive re-sampling, cluster-based filtering, and dual-branch noise modeling—yield significant improvements in F1 scores and verification accuracy.
  • The framework mitigates confirmation bias and controls label noise across diverse scenarios, benefiting relation extraction, face recognition, sequence labeling, and reinforcement learning.

Cleaning Automatically utilizing Self-Training (CAST) is a general paradigm for denoising and refining large-scale training data by leveraging model-in-the-loop iterative self-labeling and adaptive filtering. CAST has emerged as a practical solution for noisy supervision in domains characterized by incomplete annotation, label noise, or contamination arising from web-scale data collection. Its core principle is to use self-training, with additional class- or cluster-adaptive mechanisms to selectively retain, correct, or discard pseudo-labeled instances, thereby producing progressively cleaner and higher utility training sets for subsequent learning rounds.

1. Conceptual Motivation and Problem Settings

CAST addresses both closed-world and open-world learning scenarios where the available supervision is systematically deficient or corrupted.

  • Incomplete annotation: In relation extraction (RE), datasets such as DocRED and ChemDisgene exhibit significant numbers of false negatives—valid relations missed and labeled as "no_relation". Training under a closed-world assumption yields high precision for observed classes but extremely poor recall, especially for minority classes (Tan et al., 2023).
  • Label corruption or web noise: In unconstrained visual recognition settings (e.g., face identity datasets crawled from the internet), folder-level errors, duplicates, and identity collisions introduce severe noise. Example: WebFace260M, where queried folder labels are often incorrect, requiring scalable automatic filtering for million-scale corpora (Zhu et al., 2021).
  • Reinforcement learning and agentic LLMs: Exploration policies generate error-contaminated trajectories. Retaining failed interaction segments naively can hinder credit assignment and overfit agents to error-recovery behaviors (Xu et al., 21 Jan 2026).
  • Low-resource self-training: In sequence labeling, a seed model's prediction on large unlabeled pools can propagate unreliable annotations, destabilizing training if noisy pseudo-labels are merged indiscriminately (Paul et al., 2019).

The core challenge motivating CAST is to leverage model-side predictions for label or instance curation while avoiding confirmation bias, majority-class domination, or the compounding of systematic labeling errors.

2. Core CAST Methodologies

CAST instantiates an iterative teacher–student self-training loop augmented by instance-level or class-adaptive denoising.

2.1. Relation Extraction—Class-Adaptive Re-Sampling

After pseudo-label generation, each class cc computes:

  • Precision pc={pred=c}{gold=c}{pred=c}p_c = \frac{|\{\text{pred}=c\} \cap \{\text{gold}=c\}|}{|\{\text{pred}=c\}|}
  • Recall rc={pred=c}{gold=c}{gold=c}r_c = \frac{|\{\text{pred}=c\} \cap \{\text{gold}=c\}|}{|\{\text{gold}=c\}|}
  • Sampling weight wc=(pc(1rc))βw_c = (p_c \cdot (1 - r_c))^{\beta}

Only pseudo-labels with class cc are retained with probability wcw_c, favoring cases with high precision but low recall. This mechanism avoids reinforcing low-quality pseudo-labels and controls recall growth (Tan et al., 2023).

2.2. Large-Scale Face Data Cleaning—Cluster- and Center-Based Filtering

The CAST pipeline for face recognition is cluster- and centroid-adaptive:

  • Intra-class: DBSCAN is used on embedding space per folder; only the largest cluster of each is retained.
  • Inter-class: Folders with highly similar centers are merged; ambiguous or impure folders are discarded based on cosine center similarity thresholds.
  • Iterative refinement: The cleaned data train a new model, which becomes the next iteration's teacher, and the procedure is repeated for several rounds (Zhu et al., 2021).

2.3. Sequence Labeling—Split Noise Channel Modeling

Clean and noisy labels are handled via dual branches, with a learned label transition matrix TT that models per-class corruption, trained by EM. The approach isolates human-annotated data from self-labeled predictions, avoiding contamination of reference-quality supervision (Paul et al., 2019).

2.4. RL Trajectory Cleaning—Similarity-Aware Adaptive Rollback (SAAR)

For agentic RL, CAST is instantiated as trajectory purification by:

  • Detecting failed turns and auto-invoking a self-correction loop (lookahead).
  • Computing code similarity via normalized LCS length, S(ct,ct)S(c_t, c_t').
  • For SγS \geq \gamma, only the action is updated (shallow repair), for S<γS < \gamma the entire turn is replaced (deep repair).
  • Only purified, success-containing segments are retained for policy optimization (Xu et al., 21 Jan 2026).

3. Formal Algorithms and Pseudocode

Representative algorithmic structure (relation extraction (Tan et al., 2023)):

pc={pred=c}{gold=c}{pred=c}p_c = \frac{|\{\text{pred}=c\} \cap \{\text{gold}=c\}|}{|\{\text{pred}=c\}|}0

For WebFace260M cleaning (Zhu et al., 2021):

pc={pred=c}{gold=c}{pred=c}p_c = \frac{|\{\text{pred}=c\} \cap \{\text{gold}=c\}|}{|\{\text{pred}=c\}|}1

For RL self-purification (Xu et al., 21 Jan 2026):

pc={pred=c}{gold=c}{pred=c}p_c = \frac{|\{\text{pred}=c\} \cap \{\text{gold}=c\}|}{|\{\text{pred}=c\}|}2

4. Empirical Results and Benchmarks

4.1. Relation Extraction

CAST yielded:

  • On Re-DocRED (trained on original DocRED), ATLOP+RoBERTa baseline F1: 49.32; CAST (β=1.0) F1: 65.32 (+15.39).
  • On ChemDisgene, PubMedBERT baseline F1: 42.73; CAST F1: 54.03 (+11.3) (Tan et al., 2023).

4.2. Large-Scale Face Recognition

WebFace260M → WebFace42M cleaning:

  • Raw: 260 M faces, 4.0 M IDs.
  • After CAST: 42.47 M faces, 2.06 M IDs, noise rate <10%.
  • IJB-C 1:1 verification: MS1MV2 model 96.03%, WebFace42M model 97.70% (∼40% relative error reduction) (Zhu et al., 2021).

4.3. Low-Resource Sequence Labeling

On Chunking and NER (micro-F₁):

Model Chunking F₁ (190K noisy) NER F₁ (190K noisy)
NN 91.5 61.7
NLNN 91.4 61.5
CNLNN (CAST) 92.0 62.1
MTL+CNLNN 92.7 64.2

(Paul et al., 2019)

4.4. RL Policy Optimization

CLEANER's SAAR yields:

  • AIME24 pass@1: 66.7% → 72.7% (+6.0)
  • GPQA: 56.9% → 60.2% (+3.3)
  • Training steps to SOTA accuracy: CLEANER, 250; baseline, 750 (3× improvement) (Xu et al., 21 Jan 2026).

5. Analytical Insights and Trade-Offs

  • Confirmation bias mitigation: Class-adaptive weights or explicit noise modeling prevent runaway error propagation.
  • Long-tail/majority impact: CAST's selective sampling (as in RE) recovers recall on minority classes without severe precision trade-off (Tan et al., 2023).
  • Computational burden: CAST multiplies training cost by the number of folds × rounds, or requires nontrivial cluster analysis (face cleaning), or incurs extra inference (RL self-correction). All variants are amenable to large-scale parallelization.
  • Instance isolation: Dual-branch modeling (sequence labeling) protects clean data from noisy pseudo-label drift (Paul et al., 2019).
  • Partial coverage limitation: Classes with zero precision (no positives in dev) may remain unrecoverable under current CAST instantiations (Tan et al., 2023).
  • Robustness transfer: In RL, purified trajectories foster learning of correct reasoning chains rather than error-handling loops (Xu et al., 21 Jan 2026).

6. Implementation Strategies and Hyperparameter Choices

  • Relation extraction: β≈1.0 balances recall with precision. N=5 folds, M=5 rounds. DEV F1 for early stopping (Tan et al., 2023).
  • Face cleaning: DBSCAN parameters (eps=0.5, minPts=2); inter-class merge threshold 0.70, ambiguous filter [0.5,0.7]; ResNet-100 ArcFace backbone (Zhu et al., 2021).
  • Sequence labeling: EM-estimated label transition matrix, with or without POS auxiliary multitask regularization. Gradient updates alternate between branches (Paul et al., 2019).
  • RL cleaning: SAAR similarity γ=0.5; retry limit K=3; curriculum of 70% cleaned, 30% raw trajectories (Xu et al., 21 Jan 2026).

Best practices involve integrating class/cluster-specific filtering, iterative teacher–student retraining, and precise dev-set anchoring of performance stats.

7. Limitations and Prospective Extensions

  • Annotation dependency: All CAST variants depend on a small, high-quality reference set for class-wise validation; its quality constrains filter reliability.
  • Extreme rarity handling: Classes with no examples in reference/dev cannot be recovered unless special positive-weight floors or targeted semi-supervision are introduced.
  • Extension targets: CAST can be further generalized to other information extraction tasks (NER, event extraction), low-resource domains, and multimodal noisy data.
  • Parameter adaptivity: Automatic or meta-learned adaptation of hyperparameters (β, filtering thresholds) is a promising future direction.
  • Curriculum scheduling: Dynamic adjustment of filtering strictness, e.g., with high β decaying toward zero as self-training confidence increases, to foster more aggressive label growth as models mature (Tan et al., 2023).

CAST formalizes automated data cleaning by leveraging model predictions for adaptive instance selection or content purification. Its principled application produces demonstrable gains in recall, F₁, dataset purity, and training efficiency across diverse domains, with mechanisms tailored for class imbalance, clustering noise, or structured prediction contexts (Tan et al., 2023, Paul et al., 2019, Zhu et al., 2021, Xu et al., 21 Jan 2026).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Cleaning Automatically utilizing Self-Training (CAST).