Likert-Based HATAS Protocol Overview
- Likert-based HATAS is a framework that employs discrete, ordinal Likert scales with explicit rationale annotations to capture nuanced features such as offensiveness and factual consistency.
- It integrates targeted labels and quality control measures, including annotator screening and CRT tests, to enhance annotation reliability and mitigate bias.
- The protocol facilitates downstream model training and evaluation by mapping raw ratings into standardized categories for tasks like hate speech detection and factual consistency assessment.
A Likert-based Human- and Target-Annotated Scoring (HATAS) protocol denotes the application of discrete, often ordinal, Likert-type rating scales in the human annotation of nuanced phenomena (toxicity, hate speech, factual consistency, etc.), typically supplemented with explicit labels demarcating both the intended annotation target (such as protected groups or system-generated textual outputs) and rationales for each instance. This framework underpins several influential corpora and evaluation campaigns, facilitating fine-grained, multidimensional labelings suitable for downstream machine learning tasks, benchmarking, and model introspection.
1. Likert Scale Foundations and Category Schemes
Likert-based HATAS protocols are predicated on discrete, anchored rating scales where annotators assign values to instances across dimensions such as offensiveness or factual consistency. In "0" (Park et al., 2023), the Likert scale is three-point, spanning:
- 0 ("1"): Absence of any derogatory, harassing, or pejorative content.
- 1 ("2" or "offensive-likely"): Borderline, subtle cases (e.g., sarcasm, implicit stereotypes).
- 2 ("obviously or seriously offensive"): Clear, unambiguous hate or toxicity (e.g., explicit slurs, threats).
Each data point receives fourteen discrete ratings: seven protected-group offensiveness levels, two general flags, and four fine-grained subcategories. Rationale highlighting is mandatory for all nonzero judgements, with annotators marking phrase-level text spans that motivated their rating and separately highlighting spans corresponding to targets (e.g., the referenced group or individual).
Factual consistency HATAS tasks, as in "Investigating Crowdsourcing Protocols for Evaluating the Factual Consistency of Summaries" (Tang et al., 2021), often employ 5-point or 10-point vertical Likert scales with anchors such as:
- 1 = Completely Inconsistent
- 5 = Completely Consistent Annotation instructions stress careful reading and explicit mapping of summaries to their respective source documents.
2. Annotation Methodology, Training, and Quality Controls
Crowdsourced annotation is standard. In K-HATERS (Park et al., 2023), annotation is performed via CashMission, with 405 annotators passing a qualification consisting of guideline training plus a pilot annotation round with feedback. Each instance is rated by a single annotator—no redundancy or adjudication—emphasizing breadth at the expense of classical inter-annotator agreement.
Quality is controlled through several mechanisms:
- Cognitive Reflection Test (CRT): Annotators complete a 6-question CRT (scored 0–6) before labeling. The CRT is used as a proxy for attentiveness and resulting annotation quality.
- Rule-based filtering: Items are excluded if annotation logic is violated (e.g., missing mandatory rationales or ill-formed spans).
- No explicit inter-annotator agreement metrics (e.g., Cohen's κ, Krippendorff's α) are reported for K-HATERS, as only one label per instance exists. Instead, stratified model performance (CRT=0 versus CRT>0) and rationale completeness are used as validation signals.
- In contrast, for factual consistency, each summary is rated by three distinct Amazon Mechanical Turk workers. Reliability is reported via Krippendorff's α and split-half system-level reliability (SHR) (Tang et al., 2021).
3. Label Transformation and Model Integration
Raw Likert ratings can be mapped to downstream classification tasks or used as direct model supervision variables. In K-HATERS, a deterministic mapping converts the 14-dimensional discrete Likert vector into unified abusive-language categories (ALC):
- "Normal": If all ratings = 0.
- "Offensive": Any non-protected-group rating > 0 and all protected-group ratings = 0.
- "Hate": Any protected-group rating > 0, further bifurcated into:
- "Level-2 Hate": Max group rating = 2 and rationale span present.
- "Level-1 Hate": Max group rating = 1, or non-group offensiveness > 0 with no group rationale.
Multi-label target (TGT) flags denote if any protected group is referenced (rating > 0).
Model training objectives:
- Label-transformed:
- Label-transformed with rationale:
- Raw ratings: All losses use categorical cross-entropy; no ordinal losses are employed (Park et al., 2023).
For factual consistency, per-item ratings are usually aggregated via mean, yielding a quasi-continuous system-level score for model comparison (Tang et al., 2021).
4. Reliability, Bias, and Annotation Protocol Comparisons
Instance-level reliability for Likert-based HATAS is contingent on annotation granularity, scale choice, and annotation context.
- In factual consistency, 5-point Likert exhibits low-to-moderate instance-level Krippendorff's α (0.04–0.22) and moderate-to-high system-level SHR (0.46–0.93), with 10-point scales further boosting both metrics (α up to 0.30, SHR up to 0.95). Best-Worst Scaling with value-learning (BWS_value) achieves the highest overall reliability (α ≈ 0.30, SHR ≈ 0.93) (Tang et al., 2021).
- K-HATERS lacks reported agreement coefficients but demonstrates through ablation that CRT=0 annotators yield models with degraded accuracy, Macro-F1, increased fairness violation (higher TPR/FPR diff), and reduced faithfulness for rationales.
- Single-annotator-per-item workflows necessitate indirect reliability validation, such as rationale span completeness and CRT-stratified performance.
Table: Comparative reliability metrics (selected from (Tang et al., 2021))
| Protocol | Instance α (CNN/DM) | System SHR (XSum) |
|---|---|---|
| LS-5 | 0.044 | 0.928 |
| LS-10 | 0.129 | 0.948 |
| BWS_value | 0.293 | 0.930 |
Region bias and annotator drift can significantly impair Likert scale effectiveness unless counteracted by instruction design, restrictive annotation quotas per worker, and reannotation tasks.
5. Empirical Performance and Best Practices
Empirical experiments in K-HATERS reveal:
- Model performance (KcBERT backbone) is superior when using the 4-way ALC+TGT mapping versus treating all 14 Likert outputs as independent labels (Micro-F1 0.681 vs. 0.663).
- Models trained with ALC labels yield slightly fairer predictions (lower maximum FPR diff across groups: 0.31 vs. 0.338).
- Adding rationale supervision (H+T+R) improves explainability (plausibility and faithfulness) with only a minor loss in F1 (Park et al., 2023).
- Ordinal-aware loss functions are not utilized.
For factual consistency HATAS, the paper recommends scaling granularity be chosen to match expected output heterogeneity (e.g., 10-point for tightly-clustered systems) and prioritizing system-level split-half reliability over instance-level agreement when benchmarking models (Tang et al., 2021).
6. Bias Mitigation through Annotator Screening and Rationale
K-HATERS demonstrates that annotator CRT scores correlate with annotation quality.
- Models trained on data from annotators with the lowest CRT (score = 0) underperform on accuracy, Macro-F1, and fairness, and generate less faithful rationale explanations compared to models trained on annotators with CRT > 0.
- The CRT is proposed as a filtering or down-weighting mechanism to reduce downstream bias.
- Explicit rationale-anchored annotation further inhibits careless or random labeling, offering both direct supervision for attention-based models and a transparent mechanism for error analysis (Park et al., 2023).
7. Limitations and Open Issues
Single-annotator per instance schemes, while cost-effective, preclude direct computation of inter-annotator agreement and increase reliance on indirect measures such as rationale completeness and CRT-based splits.
- No continuous-offensiveness or "soft score" ratings are directly derived from Likert scales in K-HATERS.
- Loss functions do not exploit ordinal Likert structure, treating all categories as nominal.
- Comparative studies (Tang et al., 2021) indicate that relative judgment protocols (e.g., Best-Worst Scaling) may yield superior reliability to pure Likert, especially when outputs are homogeneous.
A plausible implication is that Likert-based HATAS protocols benefit from explicit bias and quality monitoring and are best suited to contexts where interpretable, labeler-transparent rationales are critical, even as fully ordinal or ranking-based frameworks may supersede them in maximizing inter-annotator reliability.
References:
- "K-HATERS: A Hate Speech Detection Corpus in Korean with Target-Specific Ratings" (Park et al., 2023)
- "Investigating Crowdsourcing Protocols for Evaluating the Factual Consistency of Summaries" (Tang et al., 2021)