MeMo Dataset: Multimodal Misogyny Benchmark
- MeMo Dataset is a benchmark designed for multimodal analysis of misogynistic content, combining expert and crowd annotations with manual text transcriptions.
- It comprises 800 carefully selected memes balanced between misogynistic and non-misogynistic instances, offering detailed binary labels and confidence scores.
- The dataset supports advanced research with fusion architectures integrating CNN and transformer models, ideal for evaluations using standard and k-fold cross-validation.
The MeMo Dataset, introduced in "Benchmark dataset of memes with text transcriptions for automatic detection of multi-modal misogynistic content" (Gasparini et al., 2021), is a rigorously constructed benchmark designed for research into the automatic multi-modal detection of misogyny in web memes. It combines expertly selected and crowdsourced binary labels with full manual transcriptions of meme text, facilitating both visual and textual approaches to cybersexism and technology-facilitated violence analysis. Below, the dataset's design, annotation protocols, available modalities, and recommended usage patterns are detailed.
1. Dataset Composition and Annotation Schema
The MeMo dataset consists of a curated collection of 800 memes, precisely balanced between 400 misogynistic and 400 non-misogynistic instances. Selection was performed by three domain experts, with only unanimously labeled memes admitted, yielding 100% expert agreement on primary class assignments and associated attributes. For each misogynistic meme, binary labels of "aggressiveness" and "irony" were assigned independently by both experts (suffix "DE") and by crowd annotators (suffix "CS").
The annotation structure per meme includes:
- memeID
- manual text transcription
- binary expert labels: misogynisticDE, aggressiveDE, ironicDE
- binary crowd labels: misogynisticCS, aggressiveCS, ironicCS
- agreement fractions: confidence_M_CS, confidence_A_CS, confidence_I_CS
Crowdsourcing involved 60 diverse workers aged 20-50, each rating a meme independently in triples, with label confidence defined as the fraction of crowd annotators agreeing on a given binary decision. No formal Cohen’s κ or Krippendorff’s α values are reported; however, the standard formula for Cohen's κ is referenced:
where is observed agreement and chance agreement.
2. Source Acquisition and Selection Protocols
Memes were collected over October–November 2018 from Facebook, Twitter, Instagram, Reddit, and meme-dedicated sites. Misogynistic memes were sourced using targeted keyword searches (#girl, #women, #feminist, etc.), selected forum threads (e.g., MGTOW), and domains with documented sexist activity. Non-misogynistic memes were drawn in parallel from identical sources and keywords to avoid trivial negative sampling.
Expert raters reviewed all downloaded memes, assigning judgments for misogyny, irony, and aggressiveness. Only cases of unanimous expert agreement for primary misogyny were retained, enforcing a perfectly balanced dataset with robust class definition.
Crowdsourced annotation used the Figure Eight (Appen) platform with strict per-contributor limits (max 40 memes, 90 minutes) and randomized ordering to avoid bias. Annotators answered a hierarchical set of questions per meme:
- "In your opinion, is this meme misogynistic?"
- (Conditional) "Is it ironic?"
- (Conditional) "Is it aggressive?" No operational definitions of intent were provided, capturing the annotator’s native perception.
3. Modalities and Data Format
For each meme:
- JPEG image resized to a max dimension of 640 px (zero-padded naming convention, e.g., 0001.jpg–0800.jpg).
- Manual transcription of any overlaid text, with no OCR or automated recognition employed.
The annotation CSV contains memeID, full text, six binary labels, three confidence fractions, and is indexed by unique meme identifier.
4. Dataset Availability and Access Procedures
The MeMo dataset is publicly housed at https://github.com/MIND-Lab/MEME. The repository is password-protected; users must agree to a copyright notice for access. No formal open-source license is declared in the associated Data in Brief article. Potential users must consult repository instructions to obtain the data package (images and CSV annotations).
5. Evaluation Protocols and Modeling Practices
No experimental baselines are furnished in the Data in Brief article. For modeling approaches, fusion architectures that combine CNN image encoders with transformer-based text embeddings (early or late fusion) are proposed in related work (cf. ACIIW 2019). Standard train/validation/test splits (e.g., 70%/10%/20%) are recommended, but k-fold cross-validation (typically ) can be used to maximize data utility for supervised learning.
Recommended metrics include accuracy, precision, recall, and -score for misogyny detection:
6. Intended Research Uses and Limitations
Applications encompass multimodal hate and misogyny detection on social media, pretraining/fine-tuning transformer architectures (e.g., VisualBERT, CLIP), crowdsourcing validation benchmarking, and perceptual studies on expert-public labeling gaps.
Limitations explicitly noted include:
- Dataset size (800 memes) limits suitability for complex deep models without augmentation.
- Crowd annotation agreement may be as low as 1/3 per item, compared to expert unanimity.
- Manual transcriptions are subject to annotator error (punctuation and casing).
- Absence of fine-grained misogyny subtype distinctions (e.g., body shaming, objectification).
Table: MeMo Dataset Structure
| Field | Description | Type |
|---|---|---|
| memeID | Unique zero-padded image identifier | String |
| text | Manual transcription of meme overlay text | String |
| misogynisticDE | Expert label: misogynistic content | Boolean |
| aggressiveDE | Expert label: aggressiveness (if misogynistic) | Boolean |
| ironicDE | Expert label: irony (if misogynistic) | Boolean |
| misogynisticCS | Crowd label: misogynistic content | Boolean |
| aggressiveCS | Crowd label: aggressiveness (if misogynistic) | Boolean |
| ironicCS | Crowd label: irony (if misogynistic) | Boolean |
| confidence_M_CS | Fraction of crowd agreement: misogyny | {1/3, 2/3, 3/3} |
| confidence_A_CS | Fraction of crowd agreement: aggressiveness | {1/3, 2/3, 3/3} |
| confidence_I_CS | Fraction of crowd agreement: irony | {1/3, 2/3, 3/3} |
All field definitions strictly follow those established in the official annotation CSV.
7. Context and Significance
MeMo is a foundational resource enabling rigorous evaluation and development of multimodal misogyny detection systems. Its stringent expert selection protocol ensures label certainty for core tasks, supporting social and computational studies where both perception and technical cues play critical roles. By offering per-item crowd agreement and manual transcriptions, MeMo facilitates nuanced analyses of labeling reliability and modality fusion. Limitations on granularity and dataset size, as noted, should be considered when designing high-capacity models or exploring fine-grained hate typologies.
MeMo thus provides a platform for systematic exploration of multimodal cues in misogynistic content identification and for benchmarking approaches that incorporate both visual and textual signal fusion (Gasparini et al., 2021).