Human-in-the-Loop Multi-Agent Annotation

Updated 9 March 2026

HITL-MAA is a semi-automated multi-agent annotation framework that combines LLM agents with targeted human intervention to ensure high-quality, scalable data labeling.
It employs ensemble, pipeline, and hybrid workflows to dynamically trigger expert review on ambiguous cases, thereby minimizing unnecessary human labor.
Empirical results demonstrate substantial improvements in efficiency and annotation accuracy across diverse areas like search clarification, mathematical reasoning, software QA, and legal mapping.

The Human-in-the-Loop Multi-Agent Annotation Framework (HITL-MAA) is a class of semi-automated systems that strategically combine multiple LLM agents with structured human intervention for annotation, dataset curation, and knowledge extraction in complex, multidimensional domains. The framework is designed to maximize annotation quality while minimizing human labor by dynamically invoking expert review only on difficult or undecidable cases as detected via agent confidence, ensemble disagreement, or task-specific error modes. HITL-MAA has achieved state-of-the-art performance and substantial efficiency gains in search clarification labeling, mathematical dataset creation, end-to-end software requirements annotation, and multilingual legal terminology mapping (Tavakoli et al., 1 Jul 2025, Liu et al., 2 Jun 2025, Liu et al., 16 Oct 2025, Meng et al., 15 Dec 2025).

1. Core Architectural Paradigms

Across domains, HITL-MAA operationalizes a layered, modular system in which multiple specialized LLM (or LMM) agents execute discrete subtasks in parallel or pipeline fashion. Agents may function independently (voting or aggregating outputs) (Tavakoli et al., 1 Jul 2025), sequentially in a decomposition pipeline (Liu et al., 2 Jun 2025), or as layered annotators with iterative self-correction and fallback strategies (Liu et al., 16 Oct 2025). Typical agent roles include:

Labeling agents (e.g., GPT-4o, Claude 3, Cohere Command R, Mistral 7B in search tasks) (Tavakoli et al., 1 Jul 2025)
Content extraction, data transformation, or code generation agents (Liu et al., 2 Jun 2025, Liu et al., 16 Oct 2025)
Quality control, answer filtering, or evaluation agents (Liu et al., 16 Oct 2025, Meng et al., 15 Dec 2025)
Orchestrator/Coordinator processes, facilitating hand-off, logging, and human fallback (Meng et al., 15 Dec 2025)

The architecture tightly integrates human annotators or domain experts at predetermined control points. Humans intervene for (i) calibration and threshold selection, (ii) review of ambiguous/flagged outputs, (iii) curation and correction of high-impact edge cases, and (iv) expansion of few-shot memory or retraining for agents (Tavakoli et al., 1 Jul 2025, Meng et al., 15 Dec 2025).

2. Algorithmic Workflows and Agent Interaction Protocols

The interaction protocol in HITL-MAA can be either ensemble-based or strict pipeline:

Ensemble Approach: In search clarification annotation, $K=4$ LLM agents independently label each instance with both a discrete value ( $Y_i$ ) and a calibrated confidence score ( $c_i \in [0,1]$ ). Labels are aggregated by majority vote and subjected to confidence-based auto-acceptance or human flagging according to mathematically defined thresholds:

$\mu_c = \frac{1}{K} \sum_{i=1}^K c_i, \quad \sigma_c = \sqrt{\frac{1}{K} \sum_{i=1}^K (c_i - \mu_c)^2}$

If $\mu_c \geq \tau_c$ and $\sigma_c \leq \tau_d$ , auto-accept; otherwise, defer to human judgment (Tavakoli et al., 1 Jul 2025).

Pipeline Approach: In challenging mathematical derivation curation (STORM-BORN), a six-stage sequence performs extraction, question and answer drafting, context expansion, and filtering, passing forward augmented JSON records at each stage. Humans only review finished samples, tagging for acceptance, rejection, or revision (Liu et al., 2 Jun 2025).
Hybrid Correction Loops: In end-to-end software testing (E2EDev), direct agent output is followed by recursive automated error correction, with failures after $N$ iterations escalated to a human fallback (Liu et al., 16 Oct 2025).

Agent–human escalation points are determined by formal policies or confidence criteria, often encoded into the orchestrator or coordinator logic (Meng et al., 15 Dec 2025).

3. Calibration, Thresholding, and Quality Control Mechanics

Reliable performance in HITL-MAA is predicated on robust calibration:

Threshold Selection: Calibration is performed on a human-labeled subset ( $n_\mathrm{sub}\approx 10\%$ ), where grid search across $\tau_c$ and $\tau_d$ optimizes Pareto efficiency between annotation reliability (e.g., quadratically weighted Cohen’s $Y_i$ 0) and human effort reduction (HER):

$Y_i$ 1

Optimal $Y_i$ 2 is chosen subject to a $Y_i$ 3 constraint (Tavakoli et al., 1 Jul 2025).

Filtering: To ensure dataset rigor, explicit metrics and filters are enforced:
- Reasoning-density filters, requiring a minimum number of markers (e.g., “assume,” “lemma”) or proof steps $Y_i$ 4 (Liu et al., 2 Jun 2025)
- Agent-based quality control, with human reviewers triggered when outputs violate validity, fluency, or completeness constraints (Meng et al., 15 Dec 2025)
- Iterative refinement, with agent prompt edits in response to frequent human-requested corrections (Liu et al., 2 Jun 2025)
Human Review Protocols: Expert panels rate samples against domain-specific criteria (e.g., clarity, correctness, reasoning density), or finalize controversial outputs flagged by agent glass-box checks (Liu et al., 2 Jun 2025, Meng et al., 15 Dec 2025).

4. Empirical Results and Efficiency Gains

HITL-MAA frameworks report marked improvements in annotation scalability, cost-efficiency, and output quality:

Search Clarification Annotation: On five multidimensional subtasks, HITL-MAA achieved up to 45% reduction in manual annotation (HER) while maintaining $Y_i$ 5 on all tasks, substantially outperforming the best individual LLM and simple ensemble baselines (Tavakoli et al., 1 Jul 2025).
Mathematical Dataset Curation: Less than 5% of the curated STORM-BORN challenge set was solved by SOTA models; fine-tuning with the benchmark improved mathematical reasoning accuracy by 7.8–9.1% on out-of-domain test sets. Human agreement on gold labels was at $Y_i$ 6 (Liu et al., 2 Jun 2025).
Software QA Annotation: In E2EDev, automated annotation with HITL-MAA reduced per-project annotation time by 2–3× (from $Y_i$ 78h to 3.5h), raised inter-annotator agreement from $Y_i$ 8 to $Y_i$ 9, and delivered $c_i \in [0,1]$ 0 pass rate on executable test cases at $c_i \in [0,1]$ 1\$0.50 in LLM API costs per project (Liu et al., 16 Oct 2025).
Legal Terminology Mapping: Extraction coverage increased by over 40%, with hallucination rates $c_i \in [0,1]$ 2 in top-performing configurations; LLM-agent scores correlated strongly ( $c_i \in [0,1]$ 3– $c_i \in [0,1]$ 4) with human expert ratings (Meng et al., 15 Dec 2025).

5. Scalability, Generalizability, and Limitations

HITL-MAA demonstrates linear scalability with respect to input data size due to parallelizable agent design and distributed human review interfaces. The modular nature facilitates adaptation to new domains by swapping agent specializations, tuning language- and domain-specific prompts, and retraining correction policies or few-shot memory (Meng et al., 15 Dec 2025).

Limitations include:

Dependence on initial domain coverage and prompt engineering for agent specialization
Computational cost scaling with LLM inference volume, particularly for pipeline approaches
Bottlenecks at human review for rare or highly ambiguous cases, especially in legal or code annotation
In legal settings, requirement for high-quality pivot translations and case-by-case adjustment for new law domains (Meng et al., 15 Dec 2025)
No formal theoretical error bounds, though empirical convergence is observed as curated expert memory increases (Meng et al., 15 Dec 2025)

Adaptations for new domains are facilitated by prompt memory integration and minor pipeline mutations rather than full model retraining.

6. Applications and Cross-Domain Implementations

Search Clarification: Multi-agent ensembles optimize label reliability for subjective or fine-grained information retrieval tasks, operationalizing reliable automation with fallback HITL thresholds (Tavakoli et al., 1 Jul 2025).
Mathematical Reasoning Benchmarks: Pipeline HITL-MAA protocols support creation of ultra-difficult, self-contained benchmarks, advancing LLM mathematical reasoning and providing high-fidelity supervision signals (Liu et al., 2 Jun 2025).
End-to-End Software Development: BDD-driven pipelines coordinate code instrumentation, requirement mining, test generation, and iterative implementation—anchored by agent self-correction and FITL human panels (Liu et al., 16 Oct 2025).
Multilingual Legal Mapping: Article and terminology extraction, alignment, and standardization are decomposed into multi-agent phases, elevating output coverage and linguistic precision for statutory translation resources (Meng et al., 15 Dec 2025).

Successful transferability has been demonstrated across these distinct domains without loss of auditability or cost control.

7. Evaluation Metrics and Theoretical Foundations

Multiple domain-specific and generic metrics underpin HITL-MAA quality assessment:

Metric / Notation	Definition / Measurement	Context
$c_i \in [0,1]$ 5 (Weighted $c_i \in [0,1]$ 6)	Quadratically weighted Cohen’s $c_i \in [0,1]$ 7 vs. ground-truth	Label agreement (Tavakoli et al., 1 Jul 2025)
HER	Human Effort Reduction: $c_i \in [0,1]$ 8	Efficiency (Tavakoli et al., 1 Jul 2025)
Precision/Recall/ $c_i \in [0,1]$ 9	Extracted vs. gold-standard terms	Legal mapping (Meng et al., 15 Dec 2025)
$\mu_c = \frac{1}{K} \sum_{i=1}^K c_i, \quad \sigma_c = \sqrt{\frac{1}{K} \sum_{i=1}^K (c_i - \mu_c)^2}$ 0	Multidimensional overall score across evaluation axes	Legal mapping (Meng et al., 15 Dec 2025)
$\mu_c = \frac{1}{K} \sum_{i=1}^K c_i, \quad \sigma_c = \sqrt{\frac{1}{K} \sum_{i=1}^K (c_i - \mu_c)^2}$ 1	Cost model: $\mu_c = \frac{1}{K} \sum_{i=1}^K c_i, \quad \sigma_c = \sqrt{\frac{1}{K} \sum_{i=1}^K (c_i - \mu_c)^2}$ 2	Software QA (Liu et al., 16 Oct 2025)
Inter-annotator kappa	Interrater agreement, task-dependent	All domains

Decision mechanisms rely on agent-calibrated confidence, ensemble disagreement, or both; continuous delivery of human corrections is implemented as prompt memory (“few-shot exemplars”) for agent refinement (Meng et al., 15 Dec 2025).

HITL-MAA provides a systematic, empirically validated framework for coupling LLM automation with targeted human expertise, yielding scalable and rigorously quality-controlled annotations across diverse and challenging domains (Tavakoli et al., 1 Jul 2025, Liu et al., 2 Jun 2025, Liu et al., 16 Oct 2025, Meng et al., 15 Dec 2025).