Label-Agnostic Curation
- Label-agnostic curation is a strategy that selects data, models, or embeddings without relying on ground-truth labels, enhancing transferability and efficiency.
- It employs surrogate criteria like input-based pruning, NTK proxies, and cross-schema embedding mappings to guide selection and improve generalization.
- The approach is validated both theoretically and empirically, with phase-transition analysis and benchmark results on datasets like ImageNet and CIFAR.
Label-agnostic curation refers to the family of data selection, model selection, and embedding curation strategies that operate without utilizing ground-truth labels, or that intentionally decouple ranking and selection from formal label semantics. In contrast to label-aware methods that directly leverage or require known output targets, label-agnostic curation proceeds either with surrogate criteria or through proxy representations, enabling transferability, interoperability, and often greater efficiency in scenarios ranging from neural architecture search to emotion analysis and data pruning for generalization.
1. Principled Definition and General Framework
Label-agnostic curation encompasses procedures for selecting or organizing training data, model architectures, or embedding spaces where the curation decision function is entirely independent of the labels associated with each example . Formally, a curation rule is label-agnostic if the indicator for inclusion can be written as a function of the input, model, or auxiliary signals only: with no dependence on . Such rules are foundationally distinct from label-aware versions, which define . This design paradigm arises in settings such as high-dimensional supervised learning where only input structure can be leveraged for curation, or where interoperability among disparate annotation schemas is required.
2. Label-Agnostic Curation in Data Pruning and Generalization
The theoretical underpinnings of label-agnostic data curation are developed in the context of high-dimensional supervised learning in "Why Less is More (Sometimes): A Theory of Data Curation" (Dohmatob et al., 5 Nov 2025). The method introduces a label-agnostic pruning oracle, defined by a unit vector and a thresholding function , yielding for each sample : The pruner selects examples based solely on the geometry of under , without consulting . This rule can implement "keep hard," "keep easy," or random sampling depending on the symmetry or asymmetry of . The resulting curated set consists of a fraction of the original data.
A key result is the derivation of closed-form, high-dimensional scaling laws for the test error under label-agnostic pruning. The main quantities computed are:
- Alignment cosines , , among the generator , pruning , and ground-truth .
- Explicit formulas for test error in terms of expectations over the (possibly pruned) data distribution.
Specifically, the asymptotic test error after solving regularized least-squares on the pruned data is: where , depend on the aforementioned alignment quantities and on Marchenko–Pastur transforms of the pruned data covariance.
A critical insight is the construction of a phase-transition curve (explicitly,
defining the implicit boundary in parameter space where curation transitions from being beneficial to harmful. In the limit , the critical condition becomes .
Empirical results on ImageNet validate these theoretical findings. For low-accuracy generators (small , ), "keep easy" rules maximize accuracy. For highly accurate generators (), "keep hard" rules (label-agnostic selection of difficult examples) prevent model collapse and outperform random or easy-only selection (Dohmatob et al., 5 Nov 2025).
3. Label-Agnostic Neural Architecture Search
Label-agnostic curation extends beyond data selection to model selection, typified by the NASI (Neural Architecture Search at Initialization) framework (Shu et al., 2021). Here, label-agnosticism is achieved through the use of a closed-form, label-free proxy based on the Neural Tangent Kernel (NTK) at initialization for each candidate architecture : The NTK trace norm serves as a surrogate for expected training speed and ultimate generalization. The NASI algorithm employs a differentiable search over architecture weights , optimizing: where approximates the kernel using gradients of the loss with random labels .
A crucial guarantee is that under Lipschitz-continuous losses, random labels suffice for to consistently estimate architecture quality. Empirically, the correlation between "true" NTK trace (with actual labels) and the label-free proxy is on NAS-Bench-1Shot1, confirming that the label-agnostic proxy preserves architecture ranking.
Performance metrics reported for NASI demonstrate state-of-the-art error rates with dramatically reduced search cost: | Method | CIFAR-10 err | CIFAR-100 err | Params | Mult-Adds | Search Cost (GPU h) | |-------------|--------------|---------------|--------|-----------|---------------------| | DARTS (2nd) | 2.76% | 17.54% | 3.3M | 574M | 24 | | NASI-FIX | 2.79% | 16.12% | 3.9M | 585M | 0.24 |
Transfer to ImageNet using architectures curated on CIFAR-10—without label or data alignment—achieves top-1 error 24.3% at just 0.01 GPU days, underscoring the transferability enabled by pure label-agnostic selection.
4. Label-Agnostic Curation in Cross-Schema Embedding Spaces
In the field of emotion analysis, heterogeneous taxonomies and annotation schemas fragment data interoperability. Buechel et al. (Buechel et al., 2020) formalize label-agnostic curation in embedding space, constructing a shared latent representation that encapsulates labels from multiple formats (e.g., Valence–Arousal–Dominance, basic emotion categories) and languages.
The curation proceeds in two stages:
- Learning multi-way mappings among label spaces via encoders and decoders with mapping, auto-decoding, and embedding-similarity loss terms:
- Deploying the frozen "projection heads" onto word-level or text-level base models, so that predictions in any schema require only the shared latent code.
Curating datasets with no forced alignment across languages or genres (ten lexica, five languages, multiple corpora) demonstrates that label-agnostic portable prediction heads (PPH) match or exceed the accuracy of format-specific models, and outperform post-processing approaches for zero-shot transfer across label schemas.
Average Pearson correlation improvements illustrate the efficacy: | Setting | Baseline r | +PPH r | |---------------------------|-------------|-------------| | Word-level supervised | .768 | .783 (+.015)| | Text-level zero-shot | .485 | .495 |
Disk-space reductions and architectural simplicity stem directly from the interoperability provided by label-agnostic curation.
5. Theoretical Guarantees and Phase-Transition Analysis
A primary concern in label-agnostic curation is understanding when such strategies improve generalization versus when they incur degradation. The framework of (Dohmatob et al., 5 Nov 2025) provides exact formulas for the test error under both label-agnostic and full-dataset regimes. The defining phase-transition curve determines the optimal pruning fraction and alignment for which curation is beneficial: In the unregularized, data-rich limit, this simplifies to a comparison of alignment parameters. Label-agnostic curation is favored when the pruning oracle is well-aligned with the ground-truth direction relative to the generator’s error.
This phase-boundary has concrete operational consequences. On ImageNet, optimal curation matches the theoretical phase transition, and aggressive label-agnostic pruning (of "hard" or "easy" examples) prevents model collapse and achieves test error minima predicted by the analysis.
6. Limitations, Extensions, and Impact
Label-agnostic curation assumes either a surrogate oracle (direction ), formal properties (NTK at initialization), or a shared embedding space with learnable cross-schema mappings. Its effectiveness hinges on the representational capacity of these proxies or surrogates.
Limitations include:
- Approximation of theoretical results to real models depends on infinite-width or asymptotic assumptions; proxy quality may degrade in highly nonlinear or narrow models (Shu et al., 2021, Dohmatob et al., 5 Nov 2025).
- Empirical gains rely on the existence of a well-aligned pruning direction or transferable latent code.
- Single-batch statistics in NASI may introduce variance; aggregation can mitigate ranking instability.
Nevertheless, label-agnostic curation provides unifying principles for efficient search, scalable model selection, data interoperability, and mitigation of overfitting, especially in settings where task labels are unavailable, ambiguous, or prohibitively expensive. As observed in NAS, emotion embedding, and curated data selection, these techniques enable interoperability and efficiency without compromising predictive accuracy or transfer, and provide phase-accurate guidance for practitioners seeking to balance sample size, label utility, and computational resources (Dohmatob et al., 5 Nov 2025, Shu et al., 2021, Buechel et al., 2020).