HLV-aware Active Learning
- HLV-aware Active Learning is a framework that models legitimate human label differences to capture inherent task ambiguity.
- It refines instance and annotator selection by integrating predictive uncertainty and diverse perspectives to improve model calibration.
- The approach generalizes label representation to probability distributions, leveraging both human and LLM annotations for enriched insights.
Human Label Variation (HLV)-aware Active Learning refers to a set of methodological, algorithmic, and conceptual adaptations to the standard active learning (AL) framework that explicitly recognize, model, and utilize legitimate variation in human-generated labels—capturing the fact that annotation is not always reducible to a single "ground truth." HLV-aware AL addresses the complex reality of real-world annotation, where plausible differences in annotation (HLV) are prevalent and can be informative, and where annotation systems may involve multiple annotators (including LLMs) with differing perspectives.
1. Conceptual Foundation and Motivation
HLV-aware Active Learning is motivated by the observation that most traditional AL pipelines rest on simplifying assumptions: there exists a single ground truth for each instance, the oracle provides perfect, noise-free labels, and annotation cost is uniform across items. The paper "Revisiting Active Learning under (Human) Label Variation" (Gruber et al., 3 Jul 2025) demonstrates that, especially in natural language processing and related fields, observed label variation is frequent and can stem from genuine task ambiguity, annotator perspective, or differing interpretations—not merely annotation noise.
This framework argues for explicitly decomposing observed label variation (LV) into: where HLV represents plausible variation, and error denotes mistakes or unreliability. Instead of treating all label disagreement as noise, HLV-aware AL maintains, models, and in certain contexts seeks out diverse responses, regarding them as signal about underlying ambiguity or subjectivity inherent in the data.
2. Acquisition and Instance Selection
Traditional AL instance selection—such as uncertainty sampling or representativeness-based querying—assumes the goal is to resolve uncertainty about a single correct label. In the presence of HLV, high entropy in model predictions may reflect true, irreducible ambiguity, not simply a modeling deficiency.
The HLV-aware approach suggests that standard acquisition functions become inadequate when inherent subjectivity exists. For effective instance selection, it becomes important to:
- Predict annotator disagreement, using models or heuristics for "expected human disagreement."
- Select instances where model and expected annotator uncertainty diverge; for instance, where the model is confident but humans typically disagree (or vice versa), an opportunity for improved model calibration or further insight may exist.
- Support repeated annotation for a single instance, collecting a distribution of plausible labels rather than relying solely on single-label queries.
This suggests refinements or alternatives to classic AL acquisition, such as targeting instances for additional labeling whose label distributions are most informative for uncertainty quantification or model calibration.
3. Annotator Selection and Human/Model Diversity
HLV-aware AL introduces an explicit annotator acquisition function, addressing the fact that "who" annotates a given instance is as important as "which" instance is chosen. The framework supports:
- Dynamic selection among multiple human annotators, LLMs, or other labeling systems.
- Optimization for diversity and representativeness in the annotator pool, rather than minimizing cost or seeking only the most accurate annotator.
- Strategies, such as multi-head selection models or entropy-based balancing, that ensure not only coverage of majority perspectives but also inclusion of minority or rare viewpoints.
This setup is conceptually distinct from traditional AL, which typically assumes a monolithic oracle. In HLV-aware AL, the annotator set (and their distributional contributions) are integrated into the overall acquisition logic, allowing, for instance, strategic labeling that reflects population-level diversity or operational goals.
4. Label Representation: From Discrete to Distributional
Classic AL assumes labels are discrete and single-valued. HLV-aware Active Learning generalizes label storage and usage to explicitly reflect observed variation:
- Discrete labels: still supported, but interpreted as samples from a latent distribution over label judgments.
- Probability ("soft") labels: annotators and models may provide probabilistic beliefs over classes, e.g., .
- Distributional (hierarchical) labels: higher-order uncertainty or subjectivity is captured using distributions (e.g., for binary, or for multiclass), reflecting not just mean but also confidence and plausibility range for possible labels.
Figure 1 in (Gruber et al., 3 Jul 2025) provides an explicit formal taxonomy of these label representations. Learning algorithms employed under HLV-aware AL may accordingly substitute the customary cross-entropy loss for KL divergence, Jensen-Shannon divergence, or other distributional comparisons, fitting models to label "distributions" rather than points.
5. Integration of LLMs as Annotators
HLV-aware AL also extends to the deployment of LLMs as sources of annotation. These systems:
- Can directly provide probability distributions or even higher-order uncertainty assessments for labels—inherently matching "soft" or distributional labels.
- May produce different forms of uncertainty compared to human annotators; the framework recommends evaluating LLM annotation quality, calibrating their uncertainty, and assessing their suitability for each task.
- Allow for combined or comparative annotation processes (either in parallel with human annotators or as preliminary step), although the precise alignment between LLM-provided HLV and actual human diversity remains an open research question.
A plausible implication is that HLV-aware AL may lower annotation costs by leveraging LLM diversity where suitable, but must guard against substituting LLM bias or modeling artifacts for true human variation.
6. Real-world Applications and Empirical Considerations
The need for HLV-aware AL is evidenced by real-world tasks such as natural language inference, hate speech and sentiment detection, argumentation mining, veridicality assessment, NER, and SRL—areas where label ambiguity and subjectivity are the rule rather than exception. The referenced paper surveys case studies and quantitative evaluations highlighting the limits of single-ground-truth AL in these settings, and the benefits, in downstream performance and calibration, of modeling label variation directly.
Empirical studies compare strategies such as allocating annotation budget to more annotators per instance (for better label distribution estimation) versus labeling more instances but with reduced diversity, with findings that vary by task but highlight the centrality of label variation to understanding and improving annotation outcomes.
7. Future Directions and Methodological Challenges
Key open avenues in HLV-aware Active Learning include:
- Developing systematic methods to distinguish HLV (plausible disagreement) from annotation noise in observed label variation, a task that remains largely open.
- Designing and empirically validating acquisition functions that select not just the "most uncertain" instance, but the one whose labeling is expected to yield the most informative label distribution.
- Improving annotation frameworks and tools to natively acquire and summarize probabilistic or multi-annotator distributions, including interfaces for non-discrete labeling and tools to aggregate LLM- and human-generated label variability.
- Rigorous evaluation of LLMs as annotators in HLV contexts, with respect to their capacity to emulate human disagreement, detect task subjectivity, and provide well-calibrated uncertainty.
A plausible implication is that HLV-aware AL practices will require novel evaluation metrics and standards, as classic accuracy is ill-defined when task ambiguity is irreducible and ground-truth is distributional.
HLV-aware Active Learning offers a principled methodology aligning the AL pipeline with the factual complexities of annotation, namely, that many tasks possess inherent subjectivity, annotator perspectives matter, and the "truth" is often best modeled as a distribution. This approach calls for new acquisition, annotator selection, and learning strategies, and anticipates the growing use of LLMs as both annotation instruments and subjects of paper in this domain.