Misogyny Annotation Guidelines

Updated 1 February 2026

Misogyny annotation guideline schemes are formal frameworks that standardize the labeling of misogynistic language with clear taxonomies, severity scales, and operational definitions.
They incorporate binary, multi-class, and ordinal labeling methods, utilizing examples like Let-Mi, BiaSWE, and PejorativITy for nuanced categorization.
Protocols include comprehensive annotator training, high inter-rater agreement measures, and iterative refinement to enhance NLP model performance.

Misogyny annotation guideline schemes are formal frameworks designed to train annotators for the systematic identification, categorization, and measurement of misogynistic language across diverse linguistic, cultural, and social contexts. These guidelines operationalize complex theoretical constructs and label taxonomies in a reproducible manner, enabling high inter-rater agreement and reliable data annotation for downstream computational models in NLP, computational social science, and networked moderation. This article presents an authoritative synthesis of contemporary misogyny annotation guideline schemes, drawing from validated protocols used across Arabic, English, German, Swedish, Italian, and multilingual contexts.

1. Annotation Task Formalization and Scope

Annotation schemes delineate the scope of misogyny detection, explicitly specifying which linguistic units (tweet, comment, subtitle, forum post) are to be annotated. Guidelines prescribe detailed taxonomies:

Binary Category: Most schemes require an initial binary decision—misogynistic vs. non-misogynistic. Example: NONE/MISOGYNY (Let-Mi, Arabic) (Mulki et al., 2021), GEN/NGEN (ComMA, Multilingual) (Bhattacharya et al., 2020).
Multi-Class and Subtype Labels: Schemes extend binary with mutually exclusive subcategories (e.g., seven behavior-based types in Levantine Twitter: Discredit, Derailing, Dominance, Stereotyping & Objectification, Sexual Harassment, Threat of Violence, Damning) (Mulki et al., 2021), or unified typologies such as Stereotype, Erasure, Sexualisation, Violence, Anti-feminism (BiaSWE, Swedish) (Kukk et al., 11 Feb 2025).
Ordinal Severity: Certain protocols adopt scalar severity, such as 0–4 in German forums (no, mild, present, strong, extreme) (Petrak et al., 2022) or 1–10 in Swedish (minimal to maximal) (Kukk et al., 11 Feb 2025).
Word- and Sentence-Level Disambiguation: Advanced schemes (PejorativITy, Italian) require annotators to label words as PEJ (pejorative) or NEU (neutral), and sentences as MIS (misogynistic) or NON (non-misogynistic) (Muti et al., 2024).

Annotation targets, taxonomies, and severity ratings must be tailored for language and domain specificity, with operational definitions included for reproducibility.

2. Label Taxonomies: Subtype Systems and Construct Grounding

Schemes draw from psychological, feminist, and sociolinguistic theories:

Ambivalent Sexism Framework: Applied in both English-language subtitle annotation (Sheppard et al., 2023) and theoretically grounded schemes (Deligianni et al., 24 Jan 2026), this includes explicit hostile sexism (derogation) and benevolent sexism (paternalistic compliments that reinforce gender roles).
Behavioral and Functional Subcategories: Each subtype is defined by its linguistic behavior and intent—examples include Discredit (direct insults, no sexual/violent content), Derailing (shifting blame or topic) (Mulki et al., 2021), or Erasure (denial or minimization of women’s experience) (Kukk et al., 11 Feb 2025). Subtypes are typically mutually exclusive per instance, though multi-labeling is permitted in schemes such as the eleven-layer taxonomy for online misogyny (Strathern et al., 2022).
Operationalization Table

Guideline Source	Taxonomy Root	Subcategories / Layers
Let-Mi (Mulki et al., 2021)	Binary + 7 subtypes	Discredit, Derailing, Dominance, Stereotyping, Sexual Harassment, Threat, Damning
ComMA (Bhattacharya et al., 2020)	Binary + contextual tag	Attack on gender roles, sexual slurs, gender swap test, victim blaming
BiaSWE (Kukk et al., 11 Feb 2025)	Binary + 5 categories	Stereotyping, Erasure, Sexualisation, Violence, Anti-feminism
German Forums (Petrak et al., 2022)	Ordinal (0–4)	Absence, Mild, Present, Strong, Extreme
PejorativITy (Muti et al., 2024)	Word/Sentence	PEJ/NEU (word), MIS/NON (sentence)

Subtype definitions are always accompanied by example utterances and key decision cues, ensuring annotator consistency.

3. Annotation Protocols: Stepwise Procedures, Heuristics, and Edge Case Resolution

All schemes detail stepwise procedures and decision rules for annotators:

Taxonomy Navigation: Annotators follow a guided workflow starting from binary existence detection, moving to dominant subtype choice or severity rating.
Contextual Heuristics: Protocols such as ComMA (Bhattacharya et al., 2020) employ the "intentionality test," "gender swap test" for jokes, clarity rules for satire/sarcasm, adjudication tags for ambiguous cases (UNC), and separate handling for poetry/code-mixed input.
Disambiguation by Context: PejorativITy (Muti et al., 2024) specifies rules to label words as pejorative only if used insultingly in immediate context, including reported speech and metaphor.
Multi-label and Priority Rules: For multilayered content, annotators assign all relevant subcategories, with explicit acting as override in eleven-layer English schemes (Strathern et al., 2022).
Target and Agency: Annotators may assign ACTIVE (addressed) or PASSIVE (generic) target label in Arabic guideline (Mulki et al., 2021), or apply agency criteria (Biasly, English (Sheppard et al., 2023)).
Handling Ambiguity: UNC tags (ComMA (Bhattacharya et al., 2020)), “Uncertain” (Eleven-layer (Strathern et al., 2022)), or leaving unresolved cases in the raw data for future adjudication (BiaSWE (Kukk et al., 11 Feb 2025)) are standard.

Procedures are supported by calibration batches, consensus discussions, and reference rubrics.

Annotation effectiveness is systematically assessed:

Annotator Onboarding: Requires training batches (e.g., ≥500 items, ComMA (Bhattacharya et al., 2020)), kick-off calibration meetings (BiaSWE (Kukk et al., 11 Feb 2025)), and reference rubrics/examples (German Forums (Petrak et al., 2022)).
Inter-Annotator Agreement (IAA): Metrics include Cohen’s $\kappa$ , Krippendorff’s $\alpha$ , observed and expected agreement. Reported values: $\alpha$ =0.36–0.64 (German forums, five-class; binary achieves higher), $\kappa$ =0.68 (theory-based scheme) (Deligianni et al., 24 Jan 2026), $\alpha$ =0.50–0.53 (PejorativITy, Italian) (Muti et al., 2024). BiaSWE reports raw agreement, noting min/max severity difference ≤3 for 91 % of posts (Kukk et al., 11 Feb 2025).
Disagreement Resolution: Majority voting, group adjudication, and iterative refinement are used, with areas of high divergence informing guideline updates.
Document Structure: Published guidelines generally include definitions, operational cues, stepwise instructions, worked examples, platform screenshots, and conflict-resolution protocols.

Quality control is essential for reliable dataset creation and reproducibility.

5. Application Impact and Validation in Model Training

Annotated corpora provide ground truth for supervised learning and model evaluation:

Dataset Benchmarking: Schemes such as Let-Mi (Arabic) (Mulki et al., 2021), PejorativITy (Italian) (Muti et al., 2024), BiaSWE (Swedish) (Kukk et al., 11 Feb 2025), and the Biasly English dataset (Sheppard et al., 2023) have been used for binary, multi-label, and severity classification tasks, achieving state-of-the-art results in their language domains.
Injection of Disambiguated Knowledge: PejorativITy (Muti et al., 2024) uses word-level sense disambiguation to inject [PEJ/NEU] labels (CONCAT/SUBST strategies) into sentence classification. Macro F₁ scores improved from 0.68 (baseline) to 0.75 (CONCAT) and 0.77 (SUBST), with upper bounds of 0.83/0.87 using gold labels.
Cross-linguistic Adaptation: Schemes adapted to local slang, context, and antifeminist tropes (BiaSWE (Kukk et al., 11 Feb 2025)), and handle code-mixed, poetry, and indirect language (ComMA (Bhattacharya et al., 2020)).
Platform Integration: Label Studio and spreadsheet drop-downs are common interfaces for annotation, supporting workflow scaling.

Validated annotation guidelines demonstrably increase the accuracy, robustness, and cultural sensitivity of misogyny detection models.

6. Theoretical Grounding and Ongoing Challenges

Recent theoretical schemes advocate explicit links to psychological and feminist frameworks:

Construct Alignment: Explicit mapping to Ambivalent Sexism Theory, Gender Essentialism, Toxic Masculinity, Gendered Racism, Post-Feminism (Deligianni et al. (Deligianni et al., 24 Jan 2026)).
Empirical Superiority: Theory-grounded coding schemes outperform conventional or “expert” annotation rubrics, especially for subtle forms of misogyny not captured by mainstream definitions or prevailing LLMs.
Limitations: Subjectivity in severity rating, context loss in decontextualized posts, unresolved code-mixing, and annotation bias remain open issues (Petrak et al., 2022, Kukk et al., 11 Feb 2025). Calibration, consensus, and regular guideline updates are advocated.
Model Generalization Difficulty: LLMs trained on guideline-based annotations struggle to match human performance, reflecting mainstream, non-theoretical conceptions of misogyny (Deligianni et al., 24 Jan 2026).

Continued iteration, multi-disciplinary input, and theoretical rigor remain necessary to capture the actual spectrum of misogynistic discourse in annotation protocols.

7. Summary Table: Annotation Scheme Features

Scheme (Paper)	Language/Domain	Taxonomy & Severity	IAA Metric(s)	Platform
Let-Mi (Mulki et al., 2021)	Levantine Arabic	Binary + 7 subtypes	Not stated	Not stated
Biasly (Sheppard et al., 2023)	English subtitles	4 classes, subtypes	Not stated	Not stated
ComMA (Bhattacharya et al., 2020)	Hindi/Bangla/English	Binary + contextual tag	Not stated (recommended)	Not stated
German (Petrak et al., 2022)	German forums	Ordinal 0–4	$\alpha$ =0.36–0.64, $\kappa$ =0.39–0.63	Spreadsheet
BiaSWE (Kukk et al., 11 Feb 2025)	Swedish forums	Binary + 5, Severity 1–10	Raw agreement	Label Studio
PejorativITy (Muti et al., 2024)	Italian tweets	Word+Sentence (PEJ/MIS)	$\alpha$ =0.50–0.53	Not stated
Theory (Deligianni et al., 24 Jan 2026)	English (general)	6 theory categories	$\kappa$ =0.68	Not stated
Eleven-layer (Strathern et al., 2022)	English social media	11 categories (explicit/implicit/other)	Not stated	Not stated

These schemes collectively reflect the contemporary state of misogyny annotation, demonstrating methodological rigor, domain adaptation, construct validity, and technical reproducibility suitable for advanced computational research.