ESAAI Pre-Annotations Overview

Updated 13 January 2026

ESAAI pre-annotations are automatically generated initial labels produced by domain-specific models to streamline human-in-the-loop annotation workflows.
They integrate algorithmic strategies with human revision, balancing model efficiency with expert validation in fields such as medical imaging, document processing, and semantic segmentation.
Empirical results show significant gains in annotation speed and quality, reducing task time while improving label consistency and data coverage.

Pre-annotations in Early Structured Annotation and Autoregressive Integration (ESAAI) workflows refer to automatically generated initial labels or markups presented to human annotators for review, correction, or augmentation prior to final data curation or model training. ESAAI pre-annotations leverage domain-specific or generic models to expedite annotation efficiency, enhance label consistency, and enable scaling to large datasets while retaining critical human oversight. The methodology has been successfully applied across applications such as medical image segmentation, video event detection, machine translation evaluation, business document extraction, multimodal LLM dataset creation, and dense semantic segmentation. This article synthesizes state-of-the-art practices, mathematical strategies, empirical findings, and practical considerations for ESAAI pre-annotation pipelines.

1. Algorithmic Strategies for Pre-Annotation Generation

Modalities and domains employ diverse algorithms to generate pre-annotations. In medical imaging, registration-based transfer aligns easily annotated modalities (e.g., MRI) to harder-to-annotate domains (e.g., intra-op ultrasound), enabling label propagation via spatial transforms $T^* = \arg\min_T[C(I_\text{MRI}\circ T, I_\text{US}) + \lambda R(T)]$ , where $C$ is a mixed intensity-gradient cost and $R(T)$ regularizes rigid motion (Faanes et al., 2024).

In NER and document understanding, pre-annotations are produced using sequence labeling models such as CRF or Transformer-based encoders. DALPHI applies active learning-driven CRF pre-annotations, recalibrating after annotator corrections (Greinacher et al., 2018), while DoSA bootstraps initial document-specific labels using LayoutLMv3 generic token tagging, geometric key-value linking, and subsequent human refinement (Shukla et al., 2022).

For MT evaluation, pre-annotations come from automatic quality estimation (QE) models optimized for high recall, segmenting text on span-level error predictions with minor/major severities. Pearmut and ESAAI use threshold-based token scorers and group contiguous "BAD" tokens into spans for annotators to post-edit (Zouhar et al., 6 Jan 2026, Zouhar et al., 2024).

Semantic segmentation in images transitions from pixel-wise suggestions to entity-superpixel overlays: ESA employs a class-agnostic mask proposal network and SLIC superpixels to construct discrete regions, then entropy-based active learning selects the most informative entities for the annotator under a strict click budget (Ge et al., 2024).

Multimodal LLM datasets achieve multi-grained coverage via interleaved templates containing both coarse scene captions and fine-grained object/region annotations in a single token sequence (Xu et al., 2024). Pre-annotation for social media propaganda spans leverages few-shot prompted LLMs, outputting structured JSONs with extracted spans, explanations, and hierarchical taxonomy assignments (Sahitaj et al., 24 Jul 2025).

2. Integration and Workflow Design

Pre-annotation pipelines integrate model-generated labels into human-in-the-loop annotation systems, typically as highlighted regions, spans, or tags in a GUI that facilitate efficient review, correction, and addition. For example, DALPHI overlays color-coded entity suggestions, allowing operations to accept, delete, extend, relabel, or add spans; the corrected batch feeds directly into model retraining and uncertainty-based acquisition for the next cycle (Greinacher et al., 2018).

In DoSA, generic form-understanding yields initial labels and bounding boxes; annotators edit in a lightweight UI, followed by document-specific model fine-tuning and iterative active learning (Shukla et al., 2022). Pearmut attaches prefilled_error_spans arrays to campaign JSONs, presenting error highlights and severity for fast editing, seamlessly plugging into ESA and other annotation protocols (Zouhar et al., 6 Jan 2026).

Entity-superpixel active learning groups regions for semantic segmentation and minimizes redundancy, presenting compact overlays that annotators verify with a single click, dramatically reducing annotation workload compared to pixel-wise paint or polygon strategies (Ge et al., 2024).

EASE deploys three complementary backends—multi-task active learning, demographic-feature-based AL, and LLM prompt querying—each serving annotation suggestions via standard REST endpoints, configurable within a modular JSON-driven UI (Deng et al., 2023).

3. Quantitative Impact on Annotation Efficiency and Quality

Empirical studies consistently demonstrate substantial gains in annotation speed, label coverage, and inter-annotator consistency:

Efficiency: ESAAI-style pre-annotations produced a 35% median reduction in video annotation time for 72% of annotators (146s vs. 190s per task) (Gutiérrez et al., 20 Oct 2025). Pearmut reports 32% time reduction for MT segment marking (32s vs. 47s) (Zouhar et al., 6 Jan 2026). DALPHI achieves ≈2-day savings for a large NER corpus at 50% recall (Greinacher et al., 2018). ESA reduces semantic segmentation click cost by 98% (102 clicks vs. ~5700 per image) (Ge et al., 2024).
Quality: Inter-annotator agreement improves—Krippendorff’s $\alpha$ for fine-grained propaganda labels jumps from 0.1233 to 0.5941 with LLM assistance, and time per tweet shrinks nearly 4x (Sahitaj et al., 24 Jul 2025). Video annotation AMI/NMI/V-measure and semantic coherence (silhouette score) are higher with pre-annotations (Gutiérrez et al., 20 Oct 2025). ESAAI for MT evaluation increases Spearman $\rho$ for span-derived scores from 0.327 to 0.671 (Zouhar et al., 2024).
Data Coverage and Consistency: Active learning and pre-annotation strategies focus human effort on uncertain or informative samples. ESA’s entropy criterion ensures broad image structure coverage (Ge et al., 2024); DoSA’s active margin sampling yields rapid performance jumps in F1 score over iterations (Shukla et al., 2022).

4. Limitations, Model Bias, and Error Analysis

Current ESAAI approaches are constrained by pre-annotation quality, registration accuracy, and model bias:

Quality Threshold: DALPHI finds ≈50% recall necessary for net annotation efficiency; below this, correction overhead neutralizes time savings (Greinacher et al., 2018).
Registration Errors: In cross-modal medical image labeling, small tumors are prone to misregistration, resulting in noisy pseudo-labels, with accuracy on slices <200mm² remaining unsolved (Faanes et al., 2024).
Model Bias: QE priming may bias annotators towards the automatic model’s style or limit diversity of error types detected, especially if the QE and translation models share LLM architectures (Zouhar et al., 2024). LLM-generated labels in propaganda annotation, while boosting agreement, may struggle with highly overlapping definitions or limited fine-grained data (Sahitaj et al., 24 Jul 2025).
Automation Bias: Studies report low automation trust bias in span-postediting scenarios, as measured by sustained edit rates and attention-check performance (Zouhar et al., 2024, Zouhar et al., 6 Jan 2026).

5. Applications Across Domains

Pre-annotation methodologies have been validated in numerous research and practical contexts:

Application Domain	Model/Technique	Main Benefit
Brain Tumor US Segmentation	MRI-to-US registration, nnU-Net	Scalable pseudo-labeled training (Faanes et al., 2024)
Named Entity Recognition	CRF, Transformer AL	Efficiency and correctness (Greinacher et al., 2018, Deng et al., 2023)
Business Document Forms	LayoutLMv3 bootstrapping	Fast startup and iterative refinement (Shukla et al., 2022)
Video Event Annotation	CLIP-ViT zero-shot, Label Studio	Annotation speed, homogeneity (Gutiérrez et al., 20 Oct 2025)
MT Quality Assessment	GPT-4-based QE, ESAAI	Halved time, improved agreement (Zouhar et al., 2024, Zouhar et al., 6 Jan 2026)
Multimodal LLM Datasets	ESAAI interleaved templates	Enhanced model task accuracy (Xu et al., 2024)
Semantic Segmentation	Entity-superpixel AL	Extreme annotation efficiency (Ge et al., 2024)
Social Media Propaganda	LLM structuring + distillation	Agreement, speed, scalability (Sahitaj et al., 24 Jul 2025)

6. Best Practices and Future Directions

Key recommendations include:

Employ slice-area exclusion filters (e.g., ≥200mm² for US images) to remove noise in small-object labeling (Faanes et al., 2024).
Use self-configuring frameworks (nnU-Net, modular UIs) to minimize hyperparameter tuning and manual pipeline design (Faanes et al., 2024, Deng et al., 2023).
Prioritize high-recall in QE models for error span priming, accepting a drop in precision with subsequent human corrections (Zouhar et al., 2024).
Combine coarse and fine-grained annotation recipes for multimodal models, curriculum upsampling of multi-grained sets in late pretraining (Xu et al., 2024).
Monitor for automation and model bias via attention checks, explicit partitioning of QE and model tasks, or distinct LLMs (Zouhar et al., 2024).
Pipeline integration should expose pre-annotation generation, human review, and iterative model update steps in code modules, preferably open-source for reproducibility and domain transferability (Shukla et al., 2022, Zouhar et al., 6 Jan 2026, Xu et al., 2024).

Identified limitations—registration for small tumors, model error in small-object segmentation, generalization to new languages or data types—motivate future exploration of transformer-based encoders, fine-grained loss functions, multimodal self-supervision, and more sophisticated active learning criteria.

7. Context and Evolution in the Field

Pre-annotation techniques within ESAAI frameworks represent an evolution from manual, domain-expert-only annotation strategies toward modular, scalable, human-in-the-loop data curation. Crossover from active learning, zero-shot vision-LLMs, structured LLM prompting, and self-configuring deep segmentation architectures increasingly allow pre-annotation systems to approach expert-level quality—even in domains traditionally considered too variable or high-noise for automation. Empirical validation in peer-reviewed arXiv publications establishes pre-annotation as a cornerstone for dataset expansion, model efficiency, and annotation cost containment across cutting-edge AI applications.