Annotator-Centric Active Learning

Updated 9 September 2025

Annotator-Centric Active Learning (ACAL) is a framework that integrates annotator variability, rationale, and cognitive factors to optimize data selection and improve model training.
It combines query selection, performance modeling, and explanation-based weighting to adapt to non-ideal annotators while reducing annotation cost.
Empirical findings show that ACAL enhances data efficiency, captures diverse human judgments, and improves model fairness across various domains.

Annotator-Centric Active Learning (ACAL) encompasses a family of strategies for active learning that explicitly integrate the characteristics, behaviors, cognitive states, and diverse perspectives of human annotators into the data selection, annotation, and model evaluation loop. ACAL recognizes that, in practical annotation scenarios, annotators are non-ideal: they may be inconsistent, bring differing expertise or subjectivity, experience fatigue, and contribute valuable rationales. By systematically modeling and leveraging these factors, ACAL aims to improve the efficiency, accuracy, and diversity of machine learning model training, particularly for subjective, complex, or cost-sensitive tasks.

1. Key Principles and Theoretical Foundation

At the foundation of ACAL is the recognition that annotation quality and annotation cost are heterogeneous across annotators, instances, and time. Unlike traditional active learning frameworks that assume a perfect, omniscient oracle, ACAL relies on models and strategies that address annotator variability, subjectivity, and cognitive factors:

Annotator Performance Modeling: Annotator performance is formalized via mappings such as $\psi: Q \times A \to \mathbb{R}$ , estimating for each query-annotator pair the expected annotation quality based on historical data, expertise, or internal state (Herde et al., 2021).
Query–Annotator Pair Selection: Data selection couples sample informativeness (e.g., uncertainty, diversity, region coverage) with annotator suitability, allowing dynamic allocation that maximizes information gain per annotation cost (Herde et al., 2021, Mortagua, 31 Jul 2025).
Incorporation of Rationale and Explanation: Annotator rationales, such as feature rankings or natural language explanations, are incorporated into acquisition or disagreement measures, e.g., by weighting committee members by their fidelity to annotator-provided rankings using Kendall’s tau, or by using explanations to guide model learning and selection (Ghai et al., 2020, Yao et al., 2023).
Diversity of Human Judgments: Especially in subjective NLP tasks, ACAL explicitly seeks to approximate the soft label distribution by sampling annotators to achieve broad coverage of differing judgments, supporting both minority and majority perspectives equally (Meer et al., 2024).
Cognitive State Modeling: Internal annotator states such as mood and fatigue are factored into annotator selection heuristics or recommendation systems, enabling real-time adaptation to maximize annotation quality (Mortagua, 31 Jul 2025).

2. Methodological Variants and Architectural Components

ACAL systems manifest a range of architectural augmentations and query strategies:

Methodological Feature	Implementation Example	Benefit
Annotator selection after data sampling	Two-stage selection (data, then annotator) (Meer et al., 2024)	Efficiently covers diversity in subjective labeling
Annotator-centric head architectures	Multi-head BERT models with one head per annotator (Wang et al., 2023)	Models individual behaviors; captures disagreement
Rationale-based weighting	Model committee weighting by feature ranking agreement (Ghai et al., 2020)	Integrates expert rationale into sample selection
Mood/fatigue-aware selection	Knowledge-based recommender using accuracy, mood, and fatigue (Mortagua, 31 Jul 2025)	Reduces label errors due to annotator state fluctuations
Noise filtering and resampling	Partitioning and resampling of “noisy” regions (Shafir et al., 6 Apr 2025)	Maintains label quality, especially in low-budget regimes
Explanation-augmented selection	Using semantic similarity to explanations for sample ranking (Yao et al., 2023)	Guides data selection by rationale diversity

Local explanation techniques (such as LIME or SHAP) are pivotal where feature-level annotator rationales are involved, and annotator history (such as semantic diversity or representation-based vectors) underpins sampling strategies in subjective tasks (Meer et al., 2024, Yao et al., 2023). Advanced strategies, such as using bandit algorithms for source selection in video annotation (Ziai et al., 2024), further generalize ACAL to complex media.

3. Evaluation Metrics and Human-Centric Assessment

Traditional metrics (e.g., macro F₁, accuracy, Jensen–Shannon divergence) are often extended in ACAL with annotator-centric metrics:

Per-Annotator F₁ and JS Divergence: Compute and aggregate performance metrics per annotator, quantifying how well minority perspectives are captured (Meer et al., 2024).
Worst-Off Annotator Metrics: Evaluate model performance for the least-well-represented annotators, applying fairness concepts to ensure no viewpoints are neglected.
Uncertainty–Disagreement Correlation: Assess the match between model uncertainty estimates and empirical annotator disagreement, particularly with multi-head models (Wang et al., 2023).
Human Trust and Explanation Usefulness: In settings where explanations are solicited, human evaluation aims to quantify the informativeness and trust engendered by explanation-augmented models (Yao et al., 2023).

The use of consensus label estimators and annotator trustworthiness weights, e.g., via CROWDLAB or majority/cohen’s kappa, also plays a central role in multi-annotator and noisy label scenarios (Goh et al., 2023, Jukić et al., 2022).

4. Domains of Application and Empirical Findings

ACAL frameworks have demonstrated empirical advantages in diverse real-world settings:

Subjective NLP Tasks: Substantial gains in data efficiency (up to 62% annotation budget reduction), broader representation of annotator perspectives, and improved modeling of minority views in hate speech, moral sentiment, and safety judgment datasets (Meer et al., 2024).
Text, Image, and Video Domains: Integrations with pre-trained language or vision–LLMs, leveraging domain expertise, show both annotation efficiency and improvements in average precision and classification performance across datasets (Ziai et al., 2024, Weeber et al., 2021, Tsvigun et al., 2023).
Low-Budget and Noisy Annotation Regimes: Noise-aware sampling and transfer learning approaches (e.g., using SVMs over SSL features with resilience to annotation noise) provide strong advantages over traditional fine-tuning or random selection, particularly when only sparse annotations can be acquired (Aggarwal et al., 2022, Shafir et al., 6 Apr 2025).
Multi-Annotator Systems: Unified frameworks (e.g., ALANNO) support balanced assignment, continuous agreement monitoring, adaptive batch selection, and extensibility to a variety of AL strategies (Jukić et al., 2022).

Some of the strongest improvements are observed in relabeling scenarios and in early-phase active learning, where correct early selection and repeated annotation significantly impact final generalization (Goh et al., 2023).

5. Challenges, Limitations, and Open Directions

The success of ACAL methods is subject to a series of practical and theoretical constraints:

Annotator Pool Heterogeneity: The effectiveness of diversity-driven annotator selection is limited by the size and heterogeneity of the available annotator pool. With low diversity or few annotators per item, performance gains diminish (Meer et al., 2024).
Dependency on Annotation History: Several strategies require sufficient history to compute diversity or trustworthiness—necessitating warm-up phases and careful design in early rounds (Meer et al., 2024).
Cognitive State Measurement: Real-time tracking of mood and fatigue requires accurate, possibly intrusive measurement; simulations often replace real-world signals, which may miss subtle influences or be difficult to scale (Mortagua, 31 Jul 2025).
Reliance on Explanation Quality: Methods that integrate annotator rationale or explanation-augmented sampling can be misled if explanations (either human or model-generated) are of low fidelity or inconsistent (Ghai et al., 2020, Yao et al., 2023).
Label Noise: While noise-aware methods such as NAS improve robustness, all coverage- and typicality-based AL approaches remain sensitive to systematic noise, especially when data is scarce or the noise filtering itself is imperfect (Shafir et al., 6 Apr 2025).

Future work is highlighted across several axes: modeling richer cognitive and behavioral annotator features, dynamically learning cost trade-offs (MC/AC), scaling to extreme annotator or instance counts, integrating better biometrics in annotator models, and developing more holistic annotator selection and evaluation strategies (Herde et al., 2021, Mortagua, 31 Jul 2025). Real-world validation with active, online human annotators remains an important direction for assessing live annotator-centric systems.

6. Broader Impact and Significance

ACAL frameworks align the machine learning model development lifecycle with the realities of human annotation, striving to minimize annotation cost while maximizing utility, validity, and fairness. By accounting for annotator reliability, diversity, rationale, and cognitive conditions, ACAL systems have demonstrated improvements in annotation efficiency, performance on subjective or ambiguous tasks, and robustness to noise and annotator disagreement.

Specific results include statistically significant reductions in queries for target performance (e.g., halving for some QBC variants (Ghai et al., 2020)), improved model trust through explanation alignment, and more representative model behavior for both majority and minority annotator perspectives. These advances are particularly salient as datasets grow in complexity and as societal demands for reliable and fair annotation—especially in subjective domains—continue to increase.

The field continues to evolve toward frameworks that manage the full spectrum of annotator diversity, cost, and cognition, positioning ACAL as a cornerstone for scalable, human-aligned AI annotation systems across domains.