Agent-as-Annotators

Updated 4 July 2026

Agent-as-Annotators is a paradigm where agents directly produce, transform, or validate annotations within structured, executable pipelines.
The framework employs diverse methods—from typed annotation protocols to multi-agent orchestration using LLMs and human reviewers—to ensure robust label generation.
Empirical results demonstrate practical gains such as improved F-measure scores and reduced human correction efforts, confirming the efficiency of hybrid annotation workflows.

Agent-as-Annotators is a research paradigm in which agents participate directly in the production, transformation, validation, or curation of annotations rather than merely consuming labeled data. In the literature, the term covers several distinct but related constructions: typed processing components that read annotations and create new annotations in a formal pipeline (Steeg, 2011), human assistant-annotators who behave like tool-using proto-agents during data collection (Kim et al., 2022), LLM- or VLM-based systems that generate, review, or adjudicate labels (Negi et al., 23 Jan 2026), and hybrid workflows in which annotation is embedded inside ongoing agent interaction rather than performed as dense offline labeling (Fu et al., 31 Oct 2025). The concept therefore denotes a family of annotation regimes rather than a single architecture.

1. Conceptual scope and definitions

The cleanest formal definition appears in the typed annotation framework TM2. There, an annotation is a typed object attached to data, with start and end positions, a value, a reference to the underlying data, and the authoring agent class; an agent is a generic transformer from input annotations of type $I$ to output annotations of type $O$ , summarized by process(List<Annotation<I>>): List<Annotation<O>> (Steeg, 2011). In that formulation, tokenizers, gazetteers, feature extractors, classifiers, gold-standard providers, and evaluators are all “agents” because they consume existing annotations and produce new ones.

Later work broadens this idea. Some systems treat an “agent” as an LLM configuration plus a prompt template and preserve it as a persistent annotation identity inside a managed workflow, as in MEGAnno+ (Kim et al., 2024). Others cast heterogeneous entities—LLMs, SLMs, and human workers alike—as annotation agents coordinated by managerial agents, as in CrowdAgent (Qin et al., 17 Sep 2025). Still others use the term more indirectly: Apollo is explicitly relevant to annotation embedded in interaction, but it is not primarily about agents labeling external datasets; instead, it turns long-horizon trajectories into selective supervision after sparse human guidance (Fu et al., 31 Oct 2025).

This suggests that the literature operates with at least three compatible meanings. In a narrow sense, an agent annotator emits labels for external data items. In a procedural sense, an agent annotator executes an expert labeling protocol. In a broader systems sense, an agent annotator can be any component that creates, filters, adjudicates, or operationalizes supervision inside a larger annotation pipeline.

2. Executable annotation as a formal system

A recurring idea in this literature is that annotation should be executable, typed, and inspectable rather than hidden inside a monolithic predictor. TM2 makes this explicit through analyses and syntheses: analyses transmit annotations of a single type between producers and consumers, while syntheses combine two annotation streams to build a model (Steeg, 2011). The key consequence is compile-time validation of pipeline structure: only components whose annotation types match can be composed.

TreeAgent extends this executable view into expert-rule annotation. Its Decoupled Declarative Decision framework compiles a natural-language expert rule $\rho$ into an executable tree $\mathcal{T}$ over a Logic Primitive Inventory containing deterministic nodes, VLM nodes, and exit nodes. The orchestrator then traverses the tree by evaluating arithmetic predicates or invoking VLM perception at individual nodes, returning a final label without any modification to orchestrator code when the expert-defined decision structure changes (Chen et al., 30 Jun 2026). In this setting, the annotation policy is externalized as configuration rather than absorbed into model weights.

AURA provides a complementary probabilistic formalization. It treats each AI annotator as a noisy labeler with a class-conditional confusion matrix $\Theta^{(a)}_{k,l} = \Pr(\ell^{(a)} = l \mid y^\star = k)$ , then uses Expectation-Maximization to infer latent true labels and annotator reliabilities without gold labels during annotation (Ghosh et al., 30 Jan 2026). Here the annotation workforce is a heterogeneous pool of off-the-shelf models, and the “annotator” abstraction is statistical rather than deliberative.

The web-agent distillation framework literally titled “Agent-as-Annotators” makes the executable-role idea even more explicit. It maps the human roles of Task Designer, Annotator, and Supervisor onto a Persona Generator, Task Generator, Agent, and Judge, generating 3,000 tasks, retaining 2,322 successful trajectories, and using those trajectories as supervised data for a student web agent (Lù et al., 9 Apr 2026). The annotation pipeline is thus decomposed into modular roles before any learning occurs.

3. Annotation embedded in interaction and human oversight

A large part of the literature does not replace humans outright; it redistributes annotation labor. In collaborative semantic-object annotation, a computer agent first proposes masks from weak and strong supervision, after which humans refine only the mistakes by flipping or subdividing superpixels. The reported progression is from an initialization $F$ -measure of 65.83 to a final $F$ -measure of 91.21, with only 25.02 coarse superpixels per image needing correction on average (Zhang et al., 2018). The agent performs repetitive geometric work, while the human resolves semantic ambiguity.

MEGAnno+ applies the same division of labor to NLP labeling. It treats an LLM as a managed annotation agent, runs it over selected subsets, stores labels and metadata, and then routes humans toward exploratory verification of suspicious cases, especially low-confidence outputs derived from token logits (Kim et al., 2024). The human is not asked to relabel everything; verification is selective, queryable, and provenance-preserving.

Apollo pushes this further into long-horizon environments. Humans do not shadow every state-action pair. Instead, they intervene asynchronously “only when the agent deviates from a promising trajectory,” sometimes every 6 hours during long-running training and every 10 minutes during script creation, while the system later applies action-level supervision control with symbolic masking and LLM-based masking (Fu et al., 31 Oct 2025). On InnovatorBench, Apollo-trained GLM-4.5 moved from final average score 11.85 to 21.86 and from best score 13.35 to 24.01, which the paper summarizes as more than a 50% improvement over the untrained baseline and a 28% improvement over the no-human-interaction variant (Fu et al., 31 Oct 2025). The key point is that annotation is no longer dense next-action labeling; it becomes sparse intervention plus post hoc supervision selection.

The nugget-annotation workflow for accountable LLM-as-a-judge evaluation makes the human role even narrower and more normative. Humans first identify what information matters as nuggets, assign categories such as Must Have, Should Have, and Avoid, and only then let the LLM perform high-volume nugget matching with 1–5 grades and supporting quotes (Dietz, 27 Jun 2026). This paper explicitly argues that the LLM should not decide what quality means; it should only apply human-authored criteria at scale.

4. Multi-agent orchestration, adjudication, and managed annotation organizations

Several recent systems treat annotation not as a single labeling act but as a managed organization. CrowdAgent is the clearest example: it models LLMs, SLMs, and human workers as annotation agents, then adds a Quality Assurance Agent, Financing Agent, and Scheduling Agent above them. Samples are considered converged when Bayesian posterior confidence exceeds a threshold set by default to 0.99, while human routing first selects the bottom 10% by confidence and then uses Core-Set to choose 5% of the total dataset for human annotation (Qin et al., 17 Sep 2025). On six multimodal classification tasks, CrowdAgent is reported as best on all six, reaching, for example, 98.21 on COV-CTR and 88.45 on V-SNLI (Qin et al., 17 Sep 2025).

MAFA translates the same idea into enterprise deployment. It combines a Query Planning Agent, four specialized annotation agents, and a Judge Agent, and reports elimination of a 1 million utterance backlog, 86% agreement with human annotators, and a typical confidence distribution of 85% high, 10% medium, and 5% low, with low-confidence cases routed to human review (Hegazy et al., 16 Oct 2025). It also reports 13.8% higher Top-1 accuracy, 15.1% improvement in Top-5 accuracy, and 16.9% better $F_1$ on the internal intent classification dataset relative to its one-agent baseline (Hegazy et al., 16 Oct 2025). Here the annotator is clearly a managed machine workforce rather than an isolated model.

Other work emphasizes adjudication more than scheduling. In fine-grained opinion analysis, multiple LLM annotators independently produce ASTE or ACOS structures, and a separate LLM adjudicator resolves them into final annotations. The strongest gains are on ACOS, where restaurant-domain medium-scale adjudication reaches $F_1 = 40.53$ , compared with the best individual medium-scale annotator at 28.41 (Negi et al., 23 Jan 2026). LinguistAgent adopts a smaller dual-agent version of this pattern: an Annotator labels text with XML spans and a Reviewer critiques and revises. On metaphor identification, Reviewer Mode improves Gemini zero-shot from 0.5070 to 0.5753 and Gemini RAG from 0.4746 to 0.5809 (Li, 5 Feb 2026).

A plausible implication is that adjudication is emerging as a first-class annotation operation. The annotator no longer needs to be singular; it can be a panel, and the annotation itself can be the output of aggregation, review, or scheduling.

5. Domain-specific realizations

The application space is wide, but the underlying pattern is stable: agents are given enough structure to act like domain-specific annotators rather than generic text generators.

Domain	System	Annotation role
Conversational image search and editing	CAISE	Assistant-annotators use search/edit tools and emit executable commands
Semantic table annotation	ReAct-based STA agent	Selects tools for CTA and CEA under ambiguity, abbreviations, and ontology constraints
Span annotation	LLM span annotators	Localize and classify spans under custom guidelines
Automated testing of agents	ATA	Generates adversarial tests, scores dialogues, and writes bug reports
Forestry remote sensing	TreeAgent	Executes expert decision trees with VLM node voting
Web-agent distillation	Agent-as-Annotators	Replaces task designer, annotator, and supervisor with modular LLM components

CAISE is an early and concrete instance of human stand-ins for agent behavior. It collects 1,611 dialogues and 6,173 task instances in which assistant-annotators use a customized image search and editing tool, and every tool action is recorded as an executable command. The baseline generator-extractor reaches 46.43 exact command accuracy, while human expert performance is 90.0 (Kim et al., 2022). The dataset therefore captures grounded action annotation, not just dialogue labeling.

Semantic table annotation shows the same pattern in a more knowledge-intensive setting. The ReAct-based STA agent uses five external tools, tailored prompts, DBpedia grounding, contextual CEA and CTA selection, and Levenshtein-based reuse of prior annotations. On Tough Tables, the Gemini variant reaches CTA $F_1 = 0.596$ and CEA $O$ 0; on BiodivTab it reaches CTA $O$ 1 and CEA $O$ 2. The same paper reports a 70% reduction in time costs and a 60% reduction in LLM token usage through redundancy reduction (Geng et al., 18 Aug 2025).

The span-annotation study moves the agent role from class labels to localized evidence. Across data-to-text evaluation, machine-translation error annotation, and propaganda detection, LLMs output structured JSON annotations with span text, label type, and justification, and the released resource contains more than 40k model and human annotations. On D2T-Eval, estimated cost per 1k outputs is \$O$310.5 for <a href="https://www.emergentmind.com/topics/claude-3-7-sonnet-3f8fd932-d715-498c-8c0a-2db4cc320238" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Claude 3.7 Sonnet</a>, and \$3.6 for o3-mini (Kasner et al., 11 Apr 2025). This establishes span annotation as a practical, fine-grained LLM annotator regime rather than a scalar-judgment proxy.

ATA shows that the annotator role can also become evaluative and adversarial. It performs code analysis, designer interrogation, literature mining, persona-driven dialogue generation, rubric-based judging, adaptive difficulty updates, and report generation, completing runs in 20–30 minutes versus ten-annotator rounds that took days (Komoravolu et al., 24 Aug 2025). TreeAgent, by contrast, illustrates a scientific-domain variant: the system executes expert-defined forestry labeling rules, invokes VLMs only at perceptual nodes, and uses $O$ 4 majority voting to mitigate stochasticity (Chen et al., 30 Jun 2026). The web distillation system “Agent-as-Annotators” then shows that these ideas also apply to synthetic trajectory creation: a single frontier teacher yields 41.5% on WebArena after supervised fine-tuning on 2,322 filtered trajectories (Lù et al., 9 Apr 2026).

6. Reliability, governance, and open problems

The literature is clear that “agent-as-annotators” is not equivalent to “humans are unnecessary.” Apollo is highly relevant to annotation-through-interaction, but it is not literal external-data labeling; its supervision object is the trajectory, not a third-party dataset (Fu et al., 31 Oct 2025). The nugget-annotation framework makes an even stronger normative claim: humans should define what matters, while LLMs only perform constrained matching (Dietz, 27 Jun 2026). These works narrow the machine role precisely to preserve accountability.

Reliability remains uneven, especially in expert domains. In finance, law, and biomedicine, individual LLM annotators with CoT, self-refine, or self-consistency show only marginal or even negative gains, reasoning models are usually not significantly better than non-reasoning models, and multi-agent discussion helps but remains bounded by initial answer quality; the paper also reports that Claude 3.7 Sonnet with thinking rarely changes its initial annotations even when other agents provide correct annotations or valid reasoning (Tseng et al., 11 Aug 2025). This is strong evidence against treating current LLM annotators as universal substitutes for expert humans.

Assessment and incentives are themselves an unresolved annotation problem. The principal-agent analysis of human preference annotation argues that annotators should be modeled as strategic agents with continuous effort $O$ 5, that self-consistency monitoring can outperform expert-agreement monitoring under heterogeneous preferences, and that the first-best/second-best gap is $O$ 6 for binary contracts and $O$ 7 for linear contracts (Liu et al., 10 Feb 2025). A plausible implication is that future agent annotators—human or artificial—will need explicit monitoring and contract-like resource allocation rather than simple trust in raw outputs.

Aggregation also has limits. AURA can infer latent labels and annotator reliability without ground truth, but the same paper notes that if all agents are similarly biased, the framework can confidently infer wrong labels (Ghosh et al., 30 Jan 2026). CrowdAgent raises related concerns at the workflow level: LLM annotators may perpetuate social biases, automation may reduce the role of human annotators, and training SLMs on LLM outputs may violate model providers’ terms of use (Qin et al., 17 Sep 2025).

The broad direction is therefore mixed-initiative rather than fully autonomous annotation. Agents can now author tasks, propose labels, match nuggets, review one another, aggregate weak annotators, traverse expert rules, and write bug reports. But the strongest systems still reserve crucial functions for humans: defining criteria, validating weak points, auditing low-confidence cases, or writing the expert rules that the agents execute. The most durable formulation of Agent-as-Annotators is therefore not “agents replace annotation,” but “annotation becomes an orchestrated interaction among agents, explicit protocols, and human governance.”