MultiHateClip: Multilingual Multimodal Hate Detection
- MultiHateClip is a multilingual, multimodal dataset for detecting hate speech, providing fine-grained annotations for hateful, offensive, and normal video clips from English and Chinese platforms.
- The dataset combines automated filtering with ChatGPT and a rigorous two-stage human annotation process to capture modality-specific cues and victim group labels.
- Benchmark evaluations highlight challenges in differentiating subtle hate from offensive content, stressing the need for advanced multimodal fusion and culturally aware models.
The MultiHateClip dataset is a multilingual, multimodal benchmark specifically designed for the fine-grained detection of hateful and offensive video content across English and Chinese, with a cross-platform focus on YouTube and Bilibili. It advances hate speech research by extending annotation granularity beyond binary labels and by systematically incorporating human annotation of contributing modalities and victim groups, providing a culturally comparative framework for analyzing gender-based hate in online video ecosystems (Wang et al., 28 Jul 2024).
1. Dataset Composition and Construction
MultiHateClip consists of 2,000 short video clips (≤60 s each), evenly divided between English (YouTube) and Chinese (Bilibili). Video candidates were retrieved using 80 gender-based hate lexicons per language, derived from HurtLex and SWSR, via official platform APIs. An initial programmatic filter using ChatGPT-3.5 was used on titles and transcripts, after which a manual two-stage human annotation process labeled the 2,000 clips.
The annotation workflow incorporated:
- Two crowd annotators (18 Asian undergraduates, balanced on gender and proficiency)
- Conflict resolution by a third annotator, with escalation to domain experts (two PhD students) for persistent disagreement
The annotation schema included three mutually exclusive categories:
- Hateful: inciting discrimination or demeaning protected groups
- Offensive: distressing or vulgar, not targeting protected attributes
- Normal: neither hateful nor offensive
For Hateful and Offensive videos, annotators also marked:
- Start and end times of the segment
- Target group (Man/Woman/LGBTQ+/Other)
- Contributing modality (Text, Audio, Vision)
Cohen’s κ inter-annotator agreement for multiclass labeling was English κ = 0.62 and Chinese κ = 0.51; for the binary grouping (Hateful+Offensive vs. Normal), English κ = 0.72, Chinese κ = 0.66.
2. Label Distribution and Demographic Details
The overall label distribution differs cross-linguistically. For the English (YouTube) split: 8.2% Hateful, 25.6% Offensive, 66.2% Normal. For the Chinese (Bilibili) split: 12.8% Hateful, 19.4% Offensive, 67.8% Normal. Average video length is 34 s (English) and 32 s (Chinese).
Victim group statistics by language (selected categories, clip counts):
| Language | Women (H/O) | Men (H/O) | LGBTQ (H/O) | Others (H/O) |
|---|---|---|---|---|
| English | 40 / 121 | 28 / 60 | 35 / 30 | 17 / 34 |
| Chinese | 59 / 97 | 54 / 50 | 32 / 29 | 48 / 13 |
The "Others" category refers primarily to religion/race (English) or nationality (Chinese). The Chinese split exhibits a higher rate of Hateful content, while both languages most frequently target women. The data indicate that Chinese clips feature a greater incidence of multimodal hate expression.
3. Multimodal Contribution Analysis
MultiHateClip annotations capture, for each Hateful/Offensive video, the modality or combination of modalities responsible for conveying the harmful content. The majority of videos require more than one modality for correct identification.
Distribution of contributing modalities for Hateful/Offensive labels:
| Modality | English (H/O) | Chinese (H/O) |
|---|---|---|
| Text only | 12 / 97 | 19 / 38 |
| Audio only | 1 / 2 | 0 / 0 |
| Vision only | 0 / 12 | 0 / 6 |
| Text ⊙ Audio | 25 / 42 | 16 / 25 |
| Text ⊙ Vision | 7 / 46 | 40 / 63 |
| Text ⊙ Audio ⊙ Vision | 37 / 55 | 53 / 61 |
Unimodal indicators (text, audio, vision) alone are often insufficient; for example, many hateful Chinese videos require joint analysis of text and vision. Qualitative analysis found that distinguishing Hateful vs. Offensive content frequently depends on nuanced multimodal cues. Text-based tf-idf analysis identifies dominant hate vocabulary, including explicit slurs in both languages.
Audio features such as high amplitude and zero-crossing rate are more prevalent in Hateful/Offensive English videos, suggesting increased loudness and noisiness. Visual analysis using YOLOv3 finds that “person” detection occurs in ~70% of English Hateful/Offensive frames (vs. 63% for Normals), but detection rates are markedly lower for Chinese videos, underscoring cross-cultural biases in vision models pretrained on Western data.
4. Benchmark Evaluation and Model Performance
The benchmark task is multiclass classification indexed as Hateful, Offensive, or Normal, using various combinations of textual (T), audio (A), and visual (V) modalities. Binary (Hateful+Offensive vs. Normal) evaluation is also reported.
Core model architectures include:
- Text-only: mBERT, GPT-4 (text-only variants), Qwen-VL
- Audio-only: MFCC features, Wav2Vec2-BERT
- Vision-only: ViViT, ViT+LSTM
- Vision-language: VLM, GPT-4V, Qwen-VL
- Multimodal fusion: M1 (late-fusion of mBERT, MFCC, ViViT)
Macro-F1 is used as the main selection metric. Topline results:
| Model | Macro-F1 (EN multi) | Acc (EN multi) | Macro-F1 (ZH multi) | Acc (ZH multi) |
|---|---|---|---|---|
| GPT-4V (VL2) | 0.63 | 0.77 | 0.47 | 0.66 |
| M1 (mBERT⊙MFCC⊙ViViT) | 0.54 | 0.69 | 0.50 | 0.68 |
| ViViT (best unimodal) | 0.49 | 0.66 | 0.48 | 0.63 |
Binary task results: English—GPT-4V Macro-F1 0.79, Acc 0.81; M1 Macro-F1 0.74, Acc 0.75. For Chinese, M1 achieves Macro-F1 0.78, Acc 0.80 (mBERT-only: 0.65, 0.67).
Vision-LLMs (notably GPT-4V) are most effective for English, particularly in implicit hate detection, while late-fusion improves both English and Chinese over best unimodal models ( and , respectively). However, all models exhibit difficulty in separating Hateful from merely Offensive content, with F1(Hateful) values as low as 0.0 in several configurations.
5. Dataset Features, Schema, and Access
MultiHateClip provides annotations and meta-data for each video:
- Multiclass and binary labels (Hateful, Offensive, Normal)
- Segment timestamps for offensive/hateful segments
- Victim group(s) and responsible input modality/modalities
Preprocessing and schema:
- Text: stop-word filtered title+transcript concatenation
- Audio: raw waveform with MFCC or Wav2Vec2 features
- Vision: uniformly sampled at 1 fps, padded/truncated to 60 frames
Dataset splits are stratified (70% train, 10% val, 20% test). Full video files are not re-distributed; only meta-data and feature sets are accessible, subject to non-commercial, research-only license requirements and institutional approval.
6. Challenges, Limitations, and Prospects
Three core challenges are identified:
- Low separability between Hateful and Offensive classes due to annotation ambiguity and subtle expression, especially in implicit hate.
- Cultural and domain bias, particularly in vision models (e.g., YOLOv3 yields low detection rates for Chinese content relative to English).
- Data sparsity in Hateful labels, restricting effective supervised model training.
Recommended directions include augmenting corpora with additional languages and non-Western content, developing more sophisticated multimodal fusion architectures (e.g., early fusion, cross-attention), incorporation of external world knowledge for implicit hate, and enriching targets (e.g., with race, religion), along with longer audiovisual samples.
7. Significance and Applications
MultiHateClip is the first benchmark to provide a comprehensive, cross-lingual, multimodal resource for video hate speech detection with fine-grained annotation of protected groups and explicit multimodal source labeling (Wang et al., 28 Jul 2024). The dataset enables:
- Training and assessment of multimodal hate content classifiers
- Comparative analysis of cross-cultural hate expressions in video
- Exploration of modality-specific and joint cues in online hate
- Investigations into model deficiencies and cultural generalization
These characteristics position MultiHateClip as a foundational tool for advancing both algorithmic and sociotechnical understanding of hateful content propagation in global video platforms.