ELICE: Advanced Crowd-Labeling

Updated 18 March 2026

ELICE is a family of techniques that incorporate soft label elicitation, expert-injected estimation, and candidate labeling to aggregate uncertain crowd annotations.
Soft-label approaches capture annotator probability distributions, enhancing model calibration and robustness, albeit with increased per-annotation time.
Candidate labeling leverages sets of plausible labels to reduce bias and improve performance in ambiguous, high-class scenarios while mitigating noise.

ELICE (Crowd-labeling) denotes a family of techniques and frameworks designed to enhance the quality, efficiency, and robustness of learning from crowdsourced annotation, typically in scenarios where ground-truth labels are difficult or expensive to obtain. The term "ELICE" appears in the literature to reference at least three distinct, influential paradigms: (1) Eliciting and Learning with Soft Labels from Every Annotator (Collins et al., 2022), (2) Expert Label Injected Crowd Estimation (Khattak et al., 2016), and (3) Extended Labeling for Inference from Candidate Ensembles (candidate labeling) (Beñaran-Muñoz et al., 2018). These approaches share a common goal: to effectively leverage crowd input, including uncertainty and annotator heterogeneity, to yield high-fidelity supervision for downstream machine learning.

1. Problem Settings and Motivations

Crowd-labeling originated as a pragmatic response to the prohibitive cost and labor intensity of expert-only annotation for large-scale or complex ML datasets. Standard paradigms typically collect single hard (one-hot) labels per instance from multiple annotators and aggregate them—most commonly by simple majority vote—to approximate the latent ground-truth. This model, however, discards rich information about annotator uncertainty, expertise variation, and instance ambiguity.

Key motivations for ELICE methods include:

Reducing the number of required annotators by extracting more information per response
Exploiting annotator uncertainty, hesitancy, or partial knowledge, rather than forcing hard choices
Increasing robustness to noisy, random, or adversarial labelers
Improving model calibration and generalization by leveraging soft or probabilistic targets
Enabling operation in high-class-count or ambiguous settings where traditional labeling is inefficient or error-prone (Collins et al., 2022, Beñaran-Muñoz et al., 2018, Khattak et al., 2016).

2. Eliciting and Learning with Soft Labels from Every Annotator

The paradigm of "eliciting and learning with soft labels from every annotator" (Collins et al., 2022) formalizes a method for directly eliciting annotator belief distributions, rather than binary class choices, and training models to match these probability vectors. The core crowdsourcing protocol consists of the following elements:

Each annotator provides for each instance:
- Their primary class choice with an associated probability in [0,100]
- Optionally, their secondary class and probability
- A selection of classes designated as “definitely impossible”

Calibration is enhanced by instructing annotators to estimate how “100 crowd workers” would respond, leveraging a third-person perspective inspired by Bayesian Truth Serum.

Soft label construction proceeds as follows: Each annotator's partial probability vector is completed into a full K-vector using either (1) uniform redistribution of remaining mass over unspecified classes or (2) clamp redistribution among only the not-marked-impossible classes. The final per-image soft label is the average over M annotators’ distributions.

The standard cross-entropy loss to these soft targets is employed for model training,

$\mathcal{L}(\theta)\;=\;-\frac1N\sum_{n=1}^N\sum_{k=1}^K P_{\rm train}(y_n=k\mid x_n)\,\log f_\theta(x_n)_k.$

Performance is evaluated on soft label cross-entropy, calibration (RMSE of adaptive-bin reliability diagrams), and robustness (cross-entropy under FGSM adversarial attack with $\ell_\infty=4$ ).

Empirically, this procedure—particularly the T2 Clamp variant—achieves model performance (accuracy, calibration, adversarial robustness) comparable to large-scale hard-label aggregation, with significant reductions in number of annotators per instance. However, elicitation time per label is roughly an order of magnitude higher (median 32s compared to 1.8s) (Collins et al., 2022). Evidence indicates that even in the few-annotator regime, directly elicited soft labels confer superior performance and robustness compared to hard-label aggregation or subsampling.

3. Expert Label Injected Crowd Estimation

"Expert Label Injected Crowd Estimation" (ELICE) (Khattak et al., 2016) addresses the reliability collapse observed in crowd-labeling at high proportions of low-quality or malicious workers—i.e., the “phase transition” beyond which aggregated crowd labels become largely uninformative. This framework introduces expert labels for a small subset of instances to initialize and calibrate the aggregation process.

The ELICE family comprises several algorithms:

ELICE 1 (basic): Calibrates unknown worker ability ( $\alpha_j$ ) and instance difficulty ( $\beta_i$ ) from the subset of expert-labeled examples:

$\alpha_j = \frac{1}{n}\sum_{i\in\mathcal D'} [\mathbf{1}(L_i = l_{ij}) - \mathbf{1}(L_i \neq l_{ij})]$

$\beta_i = \frac{1}{M}\sum_{j=1}^M \mathbf{1}(L_i = l_{ij})$

Inferred labels for the remainder are obtained via weighted majority vote with $\alpha_j$ as weights. Final aggregation uses logistic weights:

$\mathrm{IL}_i = \operatorname{sign}\left(\sum_{j=1}^M \sigma(\alpha_j \beta_i)\,l_{ij}\right)$

ELICE 2 (entropy and flip): Refines $\alpha_j$ and $\beta_i$ by incorporating Shannon entropy to distinguish random from malicious labelers. Malicious ( $\alpha_j < 0$ ) workers' votes are auto-flipped.
ELICE 3 (pairwise comparison): When expert labels may be noisy, this variant uses global pairwise comparisons among workers (Bradley–Terry/Gumbel-max) and among instances to further refine estimates.

The theoretical analysis produces a PAC-style lower bound on the number of expert-labeled instances needed to estimate worker ability and instance difficulty within given error with high probability.

Empirical studies demonstrate that ELICE algorithms outperform classic aggregation baselines (Majority Voting, GLAD, Dawid–Skene EM, Karger, etc.) especially as the proportion of low-quality workers rises, with ELICE 2 maintaining accuracy up to 70–80% adversarial labeling. The approach is extensible via cluster-based expert sampling for skewed data (Khattak et al., 2016).

4. Extended Labeling for Inference from Candidate Ensembles (Candidate Labeling)

"Extended Labeling for Inference from Candidate Ensembles" (also cited as "candidate labeling") (Beñaran-Muñoz et al., 2018) differs fundamentally from hard and soft probabilistic labeling by allowing each annotator to return a set of plausible labels (candidate set) per instance. Concretely, annotators may select any $|S_{ij}| \geq 1$ labels they believe could be correct. The aggregation algorithm termed "candidate voting" estimates the MAP class label for each instance via: $\hat{y}_i = \arg\max_{c \in C} \sum_{j=1}^l \frac{1}{|S_{ij}|}\,\mathbf{1}(c \in S_{ij})$ This protocol efficiently harnesses annotator uncertainty and is particularly valuable when the number of annotators is small, the label space is large, or instances are ambiguous.

Bias–variance analysis reveals that candidate labeling substantially reduces bias (the chance of missing the true label) at the cost of slightly increased variance due to more candidate ties. In empirical benchmarks (UCI datasets, MNIST, synthetic Dirichlet simulations), candidate voting with $s > 1$ (number of chosen candidates per annotator) achieves markedly lower overall error than standard single-label aggregation, especially under high ambiguity or annotator hesitation (Beñaran-Muñoz et al., 2018).

5. Empirical Evaluation, Comparative Results, and Cost Considerations

Comparative evaluations across the three ELICE frameworks are summarized below:

Framework	Strengths	Cost/Annotation
Soft-label ELICE	High calibration, robustness, few annotators	High (e.g., 32s per image)
Expert-injected ELICE	Robust to noise/malicious crowd, explicit expert calibration	Moderate (subset experts needed)
Candidate Labeling	Efficient with small annotator pools, high class count	Low-moderate, fast to annotate

Key commonalities are the robust aggregation of broader annotator input—uncertainty quantification (Collins et al., 2022, Beñaran-Muñoz et al., 2018), ability/difficulty weighting (Khattak et al., 2016), and flexible aggregation rules designed to minimize bias or preserve calibration.

A critical trade-off underscores all approaches: direct, rich elicitation (soft labels, candidate sets) can drastically reduce required annotators per item but typically increases per-annotation cost. Efficient elicitation strategies or active protocols (label rich information only for ambiguous examples) can rebalance this trade-off.

6. Practical Recommendations and Theoretical Guidelines

Evidence indicates ELICE-style methodologies are particularly advantageous when:

The annotator pool is small, costly, or contains high-variance expertise
Task ambiguity or large label cardinality drives up annotator hesitancy or error
Domain settings require reliable modeling of uncertainty or adversarial robustness

Some operational guidelines include:

Use rich elicitation (full soft labels, candidate sets) preferentially in expert-constrained regimes or highly ambiguous data subsets
Clamp or uniform residual redistribution can be adapted to context; clamping is preferable to avoid “leaking” mass to implausible classes (Collins et al., 2022)
Employ de-aggregation or stochastic single-annotator targets as a regularizer for noisy crowds
Release per-annotator distributions (not just aggregates) to facilitate downstream personalization and advanced aggregation (Collins et al., 2022)
In candidate labeling, moderate candidate set size ( $s \approx r/2$ ) achieves strong bias reduction without excessive variance (Beñaran-Muñoz et al., 2018)
Active selection of elicitation depth (full vs. partial label) can optimize time/performance tradeoff

7. Limitations, Extensions, and Future Directions

Principal limitations of ELICE-type protocols include elevated per-annotation time, potential cognitive overhead on annotators (especially with large $K$ ), and, in the case of candidate labeling, possible drawbacks from random tie-breaking when candidate sets are too large or indiscriminately checked (Beñaran-Muñoz et al., 2018). Additionally, some downstream learners may require adaptation to work with soft or candidate label targets.

Potential future extensions include:

Automated or adaptive elicitation interfaces—soliciting richer label information only for instances with high entropy or known ambiguity
Annotation interface design to minimize time per rich response (e.g., skip optional inputs if time-constrained)
Aggregation algorithms leveraging annotator histories and hierarchical reliability estimates (multi-level expertise modeling)
Large-scale empirical studies contrasting per-label cost, annotator load, and overall error in real-world, non-simulated crowd settings

ELICE frameworks—including soft-label elicitation, expert-injected estimation, and candidate ensemble voting—have demonstrated that principled, uncertainty-aware harnessing of crowd input can deliver high-quality supervision under stringent annotation budgets and in the presence of annotation noise (Collins et al., 2022, Khattak et al., 2016, Beñaran-Muñoz et al., 2018).

Markdown Report Issue Upgrade to Chat

References (3)

Eliciting and Learning with Soft Labels from Every Annotator (2022)

Toward a Robust Crowd-labeling Framework using Expert Evaluation and Pairwise Comparison (2016)

Candidate Labeling for Crowd Learning (2018)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ELICE (Crowd-labeling).