Self-Training with Classifier Disagreement

Updated 9 April 2026

SCD is a semi-supervised method that uses ensemble model disagreements to generate reliable pseudo-labels for improved robustness.
It enforces persistent classifier diversity through selective sample mining and retraining strategies, ensuring convergence under noisy conditions.
Empirical and theoretical findings show that SCD outperforms conventional self-training in handling spurious correlations, label noise, and domain shifts.

Self-Training with Classifier Disagreement (SCD) refers to a family of semi-supervised and robust learning methods that leverage an ensemble of predictive models whose disagreements on unlabeled data are used to drive learning, support model selection, and improve generalization—especially in the presence of noisy labels, domain shift, or spurious correlations. Originating from the co-training paradigm and consolidated by modern ensemble strategies, SCD achieves principled improvement over naive self-training by enforcing a persistent diversity among models, guiding sample selection and label trust based on inter-classifier disagreement, and providing theoretical guarantees under certain data-generating assumptions.

1. Formal Framework and Algorithmic Primitives

SCD algorithms operate on a set of labeled examples and a (typically much larger) collection of unlabeled data. Let $\mathcal{X}$ denote the input space and $\mathcal{Y}$ the label set, with $L=\{(x_i, y_i)\}_{i=1}^\ell$ i.i.d. labeled samples and $U=\{x_j\}_{j=1}^u$ i.i.d. unlabeled samples. The machinery involves $F \geq 2$ classifiers or prediction heads $\{h_j\}_{j=1}^F$ (possibly sharing a feature backbone) with distinct initializations or inductive biases, to provide model diversity.

The core cycle involves alternating between:

Pseudo-labeling using ensembles: Candidate pseudo-labels are generated for unlabeled data points; the degree of (dis)agreement between the classifiers is quantified, typically via the pattern of predictions for each $x \sim U$ .
Selective sample mining: The algorithm selectively chooses unlabeled examples for further training, often focusing on samples exhibiting maximal disagreement (or, in multi-head settings, controlled via a target mix-rate $r$ measuring the expected fraction of disagreement patterns).
Ensemble retraining: Each classifier is retrained on the union of labeled data plus selected pseudo-labeled examples—sometimes with decoupled data flows to maintain model diversity.

Algorithmic variants (e.g., “ACE” (Daniels et al., 9 Sep 2025), Co-teaching+ (Yu et al., 2019), DMT (Feng et al., 2020)) differ in their definition of disagreement, pseudo-label weighting/selection, and update rules. In high-level pseudocode, for $F=2$ (ACE (Daniels et al., 9 Sep 2025)):

$\mathcal{Y}$ 7

This structure generalizes to more heads and more complex selection heuristics (Daniels et al., 9 Sep 2025, Odonnat et al., 2023).

2. Theoretical Guarantees and Proper Scoring Conditions

The theoretical foundation for SCD derives from its ability to formally bound error rates and ensure consistent learning, provided initial model diversity is maintained. Key results include:

Proper scoring rule: For ACE (Daniels et al., 9 Sep 2025), the self-training top- $k$ disagreement loss is a proper scoring rule if the enforced lower bound on the mix-rate $\mathcal{Y}$ 0 (fraction of target points with model disagreement) does not exceed the true target mix-rate. For other disagreement-based losses (e.g., DivDis, D-BAT), propriety is only achieved under stronger or more restrictive assumptions on the data distribution.
Improvement over self-training: Theoretical analyses (see (Wang et al., 2017)) establish that, under sufficient initial classifier disagreement and mutually informative disagreement sets, the true error of each classifier decreases iteratively. The extent of gain is upper-bounded by a function of the cumulative “useful” disagreement and initial classifier quality.
Convergence and diversity: As mutual retraining proceeds, both error rates and disagreement metrics converge—ultimately degenerating to single-model behavior if disagreement vanishes. Diversity-inducing mechanisms are thus pivotal for sustained benefit.
Robustness to label and selection noise: In label noise scenarios, bounds analogous to those in the classical Angluin–Laird model are achieved by adapting sample complexity requirements proportionally to the noise rate (Wang et al., 2017, Yu et al., 2019).

3. Instantiations and Variations

Several algorithms instantiate the SCD principle with distinct technical choices:

Algorithm	Disagreement Strategy	Sample Selection
ACE (Daniels et al., 9 Sep 2025)	Confident, selective disagreement (top-k among disagreement patterns, controlled by mix-rate)	High-probability pseudo-labels for each disagreement pattern
Co-teaching+ (Yu et al., 2019)	Retain only batch samples where two classifiers disagree; each updates on small-loss examples from disagreements picked by its peer	Dynamic small-loss fraction per epoch to adapt to label noise
DMT (Feng et al., 2020)	Iterative mutual pseudo-labeling; loss weights decrease with disagreement	Dynamic loss weighting, three-case rule based on confidence/disagreement
T-similarity (Odonnat et al., 2023)	Ensemble diversity via explicit agreement measure among $\mathcal{Y}$ 1 heads; use this to determine sample inclusion	Policy-based thresholding or curriculum on $\mathcal{Y}$ 2
SCD for Domain Adaptation (Sun et al., 2023)	Teacher-student disagreement partition of target data; self-training on disagreement set ( $\mathcal{Y}$ 3)	Pseudo-label loss weighted higher on disagreements, lower on agreements

Key distinctions lie in whether sample mining emphasizes maximal disagreement, high-confidence disagreement, or a balanced mixture; and in whether loss/statistical weights are assigned differently to agreement/disagreement sets.

4. Empirical Results and Application Domains

Empirical findings across a range of tasks consistently demonstrate that SCD-based methods outperform conventional self-training and “agreement-only” ensemble techniques, particularly in settings with:

Spurious Correlations and Underspecification: On complete-spurious correlation benchmarks (e.g., Waterbirds-CC, CelebA-CC), ACE matches or exceeds DivDis, D-BAT, and standard ERM when the enforced mix-rate lower bound matches the true target rate, and degrades gracefully otherwise (Daniels et al., 9 Sep 2025).
Noisy Labels: Co-teaching+ (SCD) achieves superior accuracy on highly corrupted MNIST, CIFAR-10/100, and Tiny-ImageNet, especially at high noise rates and under open-set contamination (Yu et al., 2019).
Semi-supervised Learning under Selection Bias: T-similarity-based pseudo-labeling achieves robustness to sample selection bias and improves calibration over softmax confidence, yielding double-digit absolute accuracy improvements when the labeled/unlabeled distributions diverge (Odonnat et al., 2023).
Domain Adaptation: In cross-domain opinion target extraction, SCD-driven self-training substantially outperforms domain-adversarial feature alignment and mean-teacher SSL, with larger gains when source and target domains are far apart (Sun et al., 2023).

Ablation studies confirm that the disagreement-based filtering, diversity induction, and selective pseudo-labeling are critical; self-training without ongoing disagreement enforcement yields marked regressions in both accuracy and calibration (Yu et al., 2019, Odonnat et al., 2023, Daniels et al., 9 Sep 2025).

5. Practical Considerations and Implementation

Robust application of SCD methods entails several practical guidelines:

Maintaining Model Diversity: Initialize classifiers with distinct parameters, potentially use different architectures or initial seeds, and enforce disagreement via explicit diversity-promoting terms during training (e.g., negative inner products, cross-update rules).
Setting the Mix-Rate: The mix-rate lower bound should reflect either domain knowledge or be estimated adaptively; schedules that gradually raise $\mathcal{Y}$ 4 avoid catastrophic loss when true disagreement is sparse (Daniels et al., 9 Sep 2025).
Loss Weighting and Selection Policies: Policy selection for pseudo-label thresholds or weighting functions critically affects downstream performance under distribution shift or class imbalance (Odonnat et al., 2023, Feng et al., 2020).
Ensemble Size and Computational Cost: Most practical variants use $\mathcal{Y}$ 5 for tractability, but ensemble-based methods (e.g., T-similarity) show that small sets ( $\mathcal{Y}$ 6) suffice for robust disagreement estimation (Odonnat et al., 2023).
Model Selection: For ACE and related approaches, checkpoint selection based on the combined validation loss on source and target (without using target labels) correlates strongly with true generalization, while for alternatives like DivDis or D-BAT, such unsupervised criteria can fail when mix-rate assumptions are violated (Daniels et al., 9 Sep 2025).

6. Theoretical and Practical Limitations

SCD methods presuppose sufficient initial classifier diversity—if models are aligned ab initio (e.g., via overparametrization, massive data augmentation), the useful disagreement may quickly vanish, after which SCD degenerates to standard self-training and loses its advantages (Wang et al., 2017, Yu et al., 2019). Conversely, if the enforced mix-rate for disagreement is set much higher than the true rate on the target domain, or if the task admits little classifier diversity (e.g., low Bayes error, highly redundant representation), the method may misallocate training effort and degrade performance (Daniels et al., 9 Sep 2025).

A further limitation is computational: maintaining multiple classifiers (or ensemble heads), explicit diversity terms, and selective mining increases both memory and runtime overhead relative to single-model self-training. Nevertheless, results indicate that small, properly regularized ensembles suffice for practical deployment (Odonnat et al., 2023).

7. Connections, Scope, and Impact

SCD generalizes and connects several lines of research in semi-supervised learning, robust learning under label noise, domain adaptation, and ensemble calibration. It bridges classical co-training (with strong multiview assumptions) to modern learning-theoretic and deep learning contexts where only inductive diversity is feasible, and extends the principle of learning from disagreement into the selection, loss weighting, and sample mining mechanisms at the heart of robust self-training (Wang et al., 2017, Yu et al., 2019, Daniels et al., 9 Sep 2025).

The methodology’s impact is pronounced in applications where access to reliable labels is constrained, data is non-IID, or there is an acute risk of model collapse due to spurious correlations or overconfident predictions. Recent empirical successes in domains as diverse as vision, language, structured prediction, and even scalable oversight tasks (such as measurement tampering detection) substantiate the broad utility of the SCD paradigm (Daniels et al., 9 Sep 2025, Sun et al., 2023).

The ongoing research challenge lies in automating model diversity control, devising principled adaptive strategies for sample selection, and extending theoretical guarantees to settings with rich structure and nonstandard loss landscapes.