Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 178 tok/s
Gemini 2.5 Pro 50 tok/s Pro
GPT-5 Medium 38 tok/s Pro
GPT-5 High 40 tok/s Pro
GPT-4o 56 tok/s Pro
Kimi K2 191 tok/s Pro
GPT OSS 120B 445 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

Label-Agnostic Curation

Updated 12 November 2025
  • Label-agnostic curation is a strategy that selects data, models, or embeddings without relying on ground-truth labels, enhancing transferability and efficiency.
  • It employs surrogate criteria like input-based pruning, NTK proxies, and cross-schema embedding mappings to guide selection and improve generalization.
  • The approach is validated both theoretically and empirically, with phase-transition analysis and benchmark results on datasets like ImageNet and CIFAR.

Label-agnostic curation refers to the family of data selection, model selection, and embedding curation strategies that operate without utilizing ground-truth labels, or that intentionally decouple ranking and selection from formal label semantics. In contrast to label-aware methods that directly leverage or require known output targets, label-agnostic curation proceeds either with surrogate criteria or through proxy representations, enabling transferability, interoperability, and often greater efficiency in scenarios ranging from neural architecture search to emotion analysis and data pruning for generalization.

1. Principled Definition and General Framework

Label-agnostic curation encompasses procedures for selecting or organizing training data, model architectures, or embedding spaces where the curation decision function is entirely independent of the labels yiy_i associated with each example (xi,yi)(x_i, y_i). Formally, a curation rule is label-agnostic if the indicator for inclusion pip_i can be written as a function of the input, model, or auxiliary signals only: pi=q(xi;θ,auxiliary)p_i = q(x_i; \theta, \text{auxiliary}) with no dependence on yiy_i. Such rules are foundationally distinct from label-aware versions, which define pi=q(xi,yi;)p_i = q(x_i, y_i; \cdots). This design paradigm arises in settings such as high-dimensional supervised learning where only input structure can be leveraged for curation, or where interoperability among disparate annotation schemas is required.

2. Label-Agnostic Curation in Data Pruning and Generalization

The theoretical underpinnings of label-agnostic data curation are developed in the context of high-dimensional supervised learning in "Why Less is More (Sometimes): A Theory of Data Curation" (Dohmatob et al., 5 Nov 2025). The method introduces a label-agnostic pruning oracle, defined by a unit vector wow_o and a thresholding function q:R{0,1}q:\mathbb{R}\to\{0,1\}, yielding for each sample xix_i: pi=q(xiwo)p_i = q(x_i^\top w_o) The pruner selects examples based solely on the geometry of xix_i under wow_o, without consulting yiy_i. This rule can implement "keep hard," "keep easy," or random sampling depending on the symmetry or asymmetry of q(t)q(|t|). The resulting curated set consists of a fraction p=ExN(0,I)[q(xwo)]p = \mathbb{E}_{x \sim \mathcal{N}(0, I)}[q(x^\top w_o)] of the original data.

A key result is the derivation of closed-form, high-dimensional scaling laws for the test error under label-agnostic pruning. The main quantities computed are:

  • Alignment cosines ρ\rho, ρg\rho_g, ρ\rho_* among the generator wgw_g, pruning wow_o, and ground-truth ww_*.
  • Explicit formulas for test error in terms of expectations over the (possibly pruned) data distribution.

Specifically, the asymptotic test error after solving regularized least-squares on the pruned data is: Etest=1πarccos(m0ν0)E_{\mathrm{test}} = \frac1\pi\,\arccos\left( \frac{|m_0|}{\sqrt{\nu_0}} \right) where m0m_0, ν0\nu_0 depend on the aforementioned alignment quantities and on Marchenko–Pastur transforms of the pruned data covariance.

A critical insight is the construction of a phase-transition curve (explicitly,

ΔE=Ecurated(N)Efull(N)=0\Delta E = E_{\mathrm{curated}}(N) - E_{\mathrm{full}}(N) = 0

defining the implicit boundary in parameter space where curation transitions from being beneficial to harmful. In the limit λ0,ϕ0\lambda\to0,\,\phi\to0, the critical condition becomes (ρρgρ)2+(ρ)2ρ2(\rho-\rho_g\rho_*)^2 + (\rho_*)^2 \gtreqless \rho^2.

Empirical results on ImageNet validate these theoretical findings. For low-accuracy generators (small NN, ρ<0.9\rho < 0.9), "keep easy" rules maximize accuracy. For highly accurate generators (ρ0.95\rho \approx 0.95), "keep hard" rules (label-agnostic selection of difficult examples) prevent model collapse and outperform random or easy-only selection (Dohmatob et al., 5 Nov 2025).

Label-agnostic curation extends beyond data selection to model selection, typified by the NASI (Neural Architecture Search at Initialization) framework (Shu et al., 2021). Here, label-agnosticism is achieved through the use of a closed-form, label-free proxy based on the Neural Tangent Kernel (NTK) at initialization for each candidate architecture AA: K0(X,X)=θf(X;θ0)θf(X;θ0)K_0(X, X) = \nabla_\theta f(X; \theta_0) \nabla_\theta f(X; \theta_0)^\top The NTK trace norm K0tr\|K_0\|_{\mathrm{tr}} serves as a surrogate for expected training speed and ultimate generalization. The NASI algorithm employs a differentiable search over architecture weights α\alpha, optimizing: R(A)=K~0(A)22μmax(0,K~0(A)22ν)R(A) = \|\widetilde{K}_0(A)\|_2^2 - \mu \cdot \max(0, \|\widetilde{K}_0(A)\|_2^2 - \nu) where K~0(A)\widetilde{K}_0(A) approximates the kernel using gradients of the loss with random labels y^i\hat{y}_i.

A crucial guarantee is that under Lipschitz-continuous losses, random labels suffice for K~0\widetilde{K}_0 to consistently estimate architecture quality. Empirically, the correlation between "true" NTK trace (with actual labels) and the label-free proxy is ρ0.99\rho \approx 0.99 on NAS-Bench-1Shot1, confirming that the label-agnostic proxy preserves architecture ranking.

Performance metrics reported for NASI demonstrate state-of-the-art error rates with dramatically reduced search cost: | Method | CIFAR-10 err | CIFAR-100 err | Params | Mult-Adds | Search Cost (GPU h) | |-------------|--------------|---------------|--------|-----------|---------------------| | DARTS (2nd) | 2.76% | 17.54% | 3.3M | 574M | 24 | | NASI-FIX | 2.79% | 16.12% | 3.9M | 585M | 0.24 |

Transfer to ImageNet using architectures curated on CIFAR-10—without label or data alignment—achieves top-1 error 24.3% at just 0.01 GPU days, underscoring the transferability enabled by pure label-agnostic selection.

4. Label-Agnostic Curation in Cross-Schema Embedding Spaces

In the field of emotion analysis, heterogeneous taxonomies and annotation schemas fragment data interoperability. Buechel et al. (Buechel et al., 2020) formalize label-agnostic curation in embedding space, constructing a shared latent representation ERd\mathbb{E} \cong \mathbb{R}^d that encapsulates labels from multiple formats (e.g., Valence–Arousal–Dominance, basic emotion categories) and languages.

The curation proceeds in two stages:

  • Learning multi-way mappings among label spaces via encoders gi:LiRdg_i:\mathcal{L}_i\to\mathbb{R}^d and decoders hj:RdLjh_j:\mathbb{R}^d\to\mathcal{L}_j with mapping, auto-decoding, and embedding-similarity loss terms: Ltotal=Lmap+Lauto+LsimL_{\mathrm{total}} = L_{\mathrm{map}} + L_{\mathrm{auto}} + L_{\mathrm{sim}}
  • Deploying the frozen "projection heads" hjh_j onto word-level or text-level base models, so that predictions in any schema require only the shared latent code.

Curating datasets with no forced alignment across languages or genres (ten lexica, five languages, multiple corpora) demonstrates that label-agnostic portable prediction heads (PPH) match or exceed the accuracy of format-specific models, and outperform post-processing approaches for zero-shot transfer across label schemas.

Average Pearson correlation improvements illustrate the efficacy: | Setting | Baseline r | +PPH r | |---------------------------|-------------|-------------| | Word-level supervised | .768 | .783 (+.015)| | Text-level zero-shot | .485 | .495 |

Disk-space reductions and architectural simplicity stem directly from the interoperability provided by label-agnostic curation.

5. Theoretical Guarantees and Phase-Transition Analysis

A primary concern in label-agnostic curation is understanding when such strategies improve generalization versus when they incur degradation. The framework of (Dohmatob et al., 5 Nov 2025) provides exact formulas for the test error under both label-agnostic and full-dataset regimes. The defining phase-transition curve determines the optimal pruning fraction and alignment for which curation is beneficial: ΔE<0    m0curatedν0curated>m0fullν0full\Delta E < 0 \iff \frac{|m_0^{\rm curated}|}{\sqrt{\nu_0^{\rm curated}}} > \frac{|m_0^{\rm full}|}{\sqrt{\nu_0^{\rm full}}} In the unregularized, data-rich limit, this simplifies to a comparison of alignment parameters. Label-agnostic curation is favored when the pruning oracle is well-aligned with the ground-truth direction relative to the generator’s error.

This phase-boundary has concrete operational consequences. On ImageNet, optimal curation matches the theoretical phase transition, and aggressive label-agnostic pruning (of "hard" or "easy" examples) prevents model collapse and achieves test error minima predicted by the analysis.

6. Limitations, Extensions, and Impact

Label-agnostic curation assumes either a surrogate oracle (direction wow_o), formal properties (NTK at initialization), or a shared embedding space with learnable cross-schema mappings. Its effectiveness hinges on the representational capacity of these proxies or surrogates.

Limitations include:

  • Approximation of theoretical results to real models depends on infinite-width or asymptotic assumptions; proxy quality may degrade in highly nonlinear or narrow models (Shu et al., 2021, Dohmatob et al., 5 Nov 2025).
  • Empirical gains rely on the existence of a well-aligned pruning direction or transferable latent code.
  • Single-batch statistics in NASI may introduce variance; aggregation can mitigate ranking instability.

Nevertheless, label-agnostic curation provides unifying principles for efficient search, scalable model selection, data interoperability, and mitigation of overfitting, especially in settings where task labels are unavailable, ambiguous, or prohibitively expensive. As observed in NAS, emotion embedding, and curated data selection, these techniques enable interoperability and efficiency without compromising predictive accuracy or transfer, and provide phase-accurate guidance for practitioners seeking to balance sample size, label utility, and computational resources (Dohmatob et al., 5 Nov 2025, Shu et al., 2021, Buechel et al., 2020).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (3)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Label-Agnostic Curation.