QuMATL Framework for Annotator Tendency Learning

Updated 9 September 2025

QuMATL is a framework that models annotator-specific tendencies using learnable query tokens rather than averaging out subjective biases.
Its architecture leverages shared self-attention and cross-attention to capture individual behaviors and inter-annotator correlations.
The framework introduces novel metrics like DIC and validates its approach on large-scale datasets such as STREET and AMER for robust, interpretable results.

QuMATL (Query-based Multi-Annotator Tendency Learning) is a framework designed to preserve and model annotator-specific tendencies in multi-annotator learning scenarios. It represents a departure from conventional consensus-oriented aggregation approaches, explicitly capturing individual annotator behavior through learnable query mechanisms. QuMATL formalizes the Multi-Annotator Tendency Learning (MATL) task, establishes a query-based architectural strategy, introduces novel evaluation metrics, and provides large-scale benchmark datasets for empirical validation (Zhang et al., 19 Mar 2025, Zhang et al., 23 Jul 2025).

1. Definition and Motivation

QuMATL is formulated to address the intrinsic limitations of consensus-based multi-annotator learning, wherein disagreement among annotators is typically treated as noise and averaged out in pursuit of a single ground truth. Such averaging can obscure valuable information arising from individual annotator backgrounds, expertise, or labeling bias—collectively termed "tendencies." The core motivation of QuMATL is to preserve and leverage this diversity by explicitly learning each annotator’s tendency or behavior pattern, thereby enabling more faithful modeling and analysis of subjective tasks, especially when annotation is scarce or inherently ambiguous (Zhang et al., 19 Mar 2025, Zhang et al., 23 Jul 2025).

2. Methodological Framework

The QuMATL methodology centers on a query-based modeling approach. For each input sample, the pipeline involves:

Application of a frozen, pre-trained encoder to extract feature representations $\mathbf{f}$ from the raw input (image, video, or other modalities).
Assignment of a learnable query vector $Q_k$ to each annotator $A_k$ . These parameterized query tokens are intended to encode annotator-specific tendencies and their style of label assignment.
Passage of all annotator queries through a Q-Former module, comprised of:
- A shared self-attention block that enables all queries to interact, capturing inter-annotator correlations and providing regularization by exposing structure in annotator agreement and disagreement.
- A cross-attention block in which each annotator's query attends to the encoded features, modulating its output in a manner tailored to that annotator's historic focus and bias.
Downstream task-specific heads generate individual predictions for each annotator.

The framework is suitable for scenarios with dense or sparse annotation matrices, as queries and their shared interactions allow extrapolation of annotator-specific predictions even for samples lacking exhaustive annotation coverage (Zhang et al., 19 Mar 2025, Zhang et al., 23 Jul 2025).

3. Loss Functions and Metrics

Training in QuMATL proceeds using a per-annotator loss aggregation. For $n$ annotators,

$\mathcal{L}_{\text{total}} = \sum_{k=1}^n \mathcal{L}(\hat{y}_k, y_k)$

where $\hat{y}_k$ denotes the prediction vector and $y_k$ the reference label for annotator $A_k$ on a given sample. This objective enforces that each annotator’s tendency is individually optimized, while the self-attention across queries imparts hidden supervisory signals that capture shared structure and prevent overfitting in the face of sparse supervision (Zhang et al., 19 Mar 2025, Zhang et al., 23 Jul 2025).

A key evaluation metric introduced for MATL is the Difference of Inter-annotator Consistency (DIC): $\text{DIC} = \| M - M' \|_F$ where $M$ and $M'$ are the inter-annotator consistency matrices (e.g., computed using Cohen’s kappa) for ground truth and predicted labels, respectively, and $\|\cdot\|_F$ denotes the Frobenius norm. Lower DIC values indicate greater fidelity in preservation of annotator-specific patterns of agreement and disagreement by the model (Zhang et al., 19 Mar 2025).

4. Model Architecture: Query and Self-Attention Mechanisms

QuMATL replaces the computationally expensive approach of training a wholly separate model for each annotator with a strategy employing lightweight, learnable queries. Each query token can be interpreted as a compact embedding of annotator behavior, co-evolving through training to represent focus regions, subjective biases, or latent discriminative criteria unique to its corresponding annotator.

Central to the architectural innovation is the shared self-attention module within the Q-Former. This enables conjunctive modeling of annotator-specific and inter-annotator regularities. The cross-attention mechanism then projects these query-derived perspectives onto the input feature map, isolating those semantic regions that each annotator is likely to weight most heavily when assigning a label. The approach is further validated by visualization of cross-attention heatmaps, which yield interpretable correlates of annotator focus, such as heightened attention to objects, individuals, or specific events in complex data (Zhang et al., 19 Mar 2025, Zhang et al., 23 Jul 2025).

5. Datasets and Experimental Validation

QuMATL is evaluated using two large-scale datasets, both developed specifically to probe subjective tendencies in multi-annotator settings:

STREET: Focused on urban scene perception, featuring approximately 4,300 dense labels per annotator. It captures assessments across dimensions such as happiness, healthiness, safety, liveliness, and orderliness in cityscapes (Zhang et al., 19 Mar 2025, Zhang et al., 23 Jul 2025).
AMER: A multimodal video emotion recognition dataset, providing an average of 3,118 labels per annotator and encompassing audio, video, and text streams. This dataset supports analysis of temporally dynamic and multimodal subjective judgments (Zhang et al., 19 Mar 2025, Zhang et al., 23 Jul 2025).

Empirical results indicate that QuMATL outperforms baselines (e.g., D-LEMA, PADL, MaDL) both in per-annotator accuracy and F₁ score. The DIC metric attests to the enhanced capacity of QuMATL to preserve annotator-specific patterns. Robustness is maintained under conditions of reduced annotation density, and the architecture scales efficiently due to the low-parameter query mechanisms (Zhang et al., 19 Mar 2025, Zhang et al., 23 Jul 2025).

Dataset	Labels per Annotator	Data Type
STREET	~4,300	Urban scene images
AMER	~3,118	Multimodal (video/text/audio)

6. Visualization, Explainability, and Implications

Through the analysis of attention maps generated by the Q-Former’s cross-attention layer, QuMATL offers visually grounded explanations of annotator decisions. Distinct annotator queries focus on disparate semantic elements within an image or video—such as people, animals, or environmental features—mirroring the individual's reported tendencies. In settings where interpretability of subjective assessment is paramount (clinical diagnosis, emotion recognition, urban planning), these visualizations provide transparency and diagnostic insight into automated judgment processes (Zhang et al., 19 Mar 2025, Zhang et al., 23 Jul 2025).

The approach also facilitates reconstructive inferences in sparsely annotated matrices, potentially lowering annotation costs while enhancing the statistical reliability of consensus predictions. Theoretical implications include a reframing of annotator disagreement not as noise, but as a statistical signal with longitudinal behavior patterns amenable to regularized multi-view modeling. The use of shared attention as implicit regularization draws connections to broader machine learning themes in multi-view and multi-task learning (Zhang et al., 23 Jul 2025).

7. Applications and Broader Significance

QuMATL’s query-based modeling paradigm is applicable to any scenario involving subjective, multi-annotator judgment, including but not limited to medical image analysis, perceptual studies, and sentiment/emotion recognition. By transforming how annotator-specific information is incorporated and preserved, QuMATL enables downstream workflows that are more robust, interpretable, and reflective of real-world variability in expert judgment.

A plausible implication is that the broader field of multi-annotator learning will increasingly incorporate annotator behavior modeling, leveraging structured disagreement as an information-rich input to improve training, consensus reliability, and explainability—especially when deployed in environments where annotation coverage is incomplete or subjective variation is pronounced (Zhang et al., 19 Mar 2025, Zhang et al., 23 Jul 2025).