MultiHateClip: Multilingual Multimodal Hate Detection

Updated 12 November 2025

MultiHateClip is a multilingual, multimodal dataset for detecting hate speech, providing fine-grained annotations for hateful, offensive, and normal video clips from English and Chinese platforms.
The dataset combines automated filtering with ChatGPT and a rigorous two-stage human annotation process to capture modality-specific cues and victim group labels.
Benchmark evaluations highlight challenges in differentiating subtle hate from offensive content, stressing the need for advanced multimodal fusion and culturally aware models.

The MultiHateClip dataset is a multilingual, multimodal benchmark specifically designed for the fine-grained detection of hateful and offensive video content across English and Chinese, with a cross-platform focus on YouTube and Bilibili. It advances hate speech research by extending annotation granularity beyond binary labels and by systematically incorporating human annotation of contributing modalities and victim groups, providing a culturally comparative framework for analyzing gender-based hate in online video ecosystems (Wang et al., 2024).

1. Dataset Composition and Construction

MultiHateClip consists of 2,000 short video clips (≤60 s each), evenly divided between English (YouTube) and Chinese (Bilibili). Video candidates were retrieved using 80 gender-based hate lexicons per language, derived from HurtLex and SWSR, via official platform APIs. An initial programmatic filter using ChatGPT-3.5 was used on titles and transcripts, after which a manual two-stage human annotation process labeled the 2,000 clips.

The annotation workflow incorporated:

Two crowd annotators (18 Asian undergraduates, balanced on gender and proficiency)
Conflict resolution by a third annotator, with escalation to domain experts (two PhD students) for persistent disagreement

The annotation schema included three mutually exclusive categories:

Hateful: inciting discrimination or demeaning protected groups
Offensive: distressing or vulgar, not targeting protected attributes
Normal: neither hateful nor offensive

For Hateful and Offensive videos, annotators also marked:

Start and end times of the segment
Target group (Man/Woman/LGBTQ+/Other)
Contributing modality (Text, Audio, Vision)

Cohen’s κ inter-annotator agreement for multiclass labeling was English κ = 0.62 and Chinese κ = 0.51; for the binary grouping (Hateful+Offensive vs. Normal), English κ = 0.72, Chinese κ = 0.66.

2. Label Distribution and Demographic Details

The overall label distribution differs cross-linguistically. For the English (YouTube) split: 8.2% Hateful, 25.6% Offensive, 66.2% Normal. For the Chinese (Bilibili) split: 12.8% Hateful, 19.4% Offensive, 67.8% Normal. Average video length is 34 s (English) and 32 s (Chinese).

Victim group statistics by language (selected categories, clip counts):

Language	Women (H/O)	Men (H/O)	LGBTQ (H/O)	Others (H/O)
English	40 / 121	28 / 60	35 / 30	17 / 34
Chinese	59 / 97	54 / 50	32 / 29	48 / 13

The "Others" category refers primarily to religion/race (English) or nationality (Chinese). The Chinese split exhibits a higher rate of Hateful content, while both languages most frequently target women. The data indicate that Chinese clips feature a greater incidence of multimodal hate expression.

3. Multimodal Contribution Analysis

MultiHateClip annotations capture, for each Hateful/Offensive video, the modality or combination of modalities responsible for conveying the harmful content. The majority of videos require more than one modality for correct identification.

Distribution of contributing modalities for Hateful/Offensive labels:

Modality	English (H/O)	Chinese (H/O)
Text only	12 / 97	19 / 38
Audio only	1 / 2	0 / 0
Vision only	0 / 12	0 / 6
Text ⊙ Audio	25 / 42	16 / 25
Text ⊙ Vision	7 / 46	40 / 63
Text ⊙ Audio ⊙ Vision	37 / 55	53 / 61

Unimodal indicators (text, audio, vision) alone are often insufficient; for example, many hateful Chinese videos require joint analysis of text and vision. Qualitative analysis found that distinguishing Hateful vs. Offensive content frequently depends on nuanced multimodal cues. Text-based tf-idf analysis identifies dominant hate vocabulary, including explicit slurs in both languages.

Audio features such as high amplitude and zero-crossing rate are more prevalent in Hateful/Offensive English videos, suggesting increased loudness and noisiness. Visual analysis using YOLOv3 finds that “person” detection occurs in ~70% of English Hateful/Offensive frames (vs. 63% for Normals), but detection rates are markedly lower for Chinese videos, underscoring cross-cultural biases in vision models pretrained on Western data.

4. Benchmark Evaluation and Model Performance

The benchmark task is multiclass classification $y \in \{0,1,2\}$ indexed as Hateful, Offensive, or Normal, using various combinations of textual (T), audio (A), and visual (V) modalities. Binary (Hateful+Offensive vs. Normal) evaluation is also reported.

Core model architectures include:

Text-only: mBERT, GPT-4 (text-only variants), Qwen-VL
Audio-only: MFCC features, Wav2Vec2-BERT
Vision-only: ViViT, ViT+LSTM
Vision-language: VLM, GPT-4V, Qwen-VL
Multimodal fusion: M1 (late-fusion of mBERT, MFCC, ViViT)

Macro-F1 is used as the main selection metric. Topline results:

Model	Macro-F1 (EN multi)	Acc (EN multi)	Macro-F1 (ZH multi)	Acc (ZH multi)
GPT-4V (VL2)	0.63	0.77	0.47	0.66
M1 (mBERT⊙MFCC⊙ViViT)	0.54	0.69	0.50	0.68
ViViT (best unimodal)	0.49	0.66	0.48	0.63

Binary task results: English—GPT-4V Macro-F1 0.79, Acc 0.81; M1 Macro-F1 0.74, Acc 0.75. For Chinese, M1 achieves Macro-F1 0.78, Acc 0.80 (mBERT-only: 0.65, 0.67).

Vision-LLMs (notably GPT-4V) are most effective for English, particularly in implicit hate detection, while late-fusion improves both English and Chinese over best unimodal models ( $p<0.01$ and $p<0.05$ , respectively). However, all models exhibit difficulty in separating Hateful from merely Offensive content, with F1(Hateful) values as low as 0.0 in several configurations.

5. Dataset Features, Schema, and Access

MultiHateClip provides annotations and meta-data for each video:

Multiclass and binary labels (Hateful, Offensive, Normal)
Segment timestamps for offensive/hateful segments
Victim group(s) and responsible input modality/modalities

Preprocessing and schema:

Text: stop-word filtered title+transcript concatenation
Audio: raw waveform with MFCC or Wav2Vec2 features
Vision: uniformly sampled at 1 fps, padded/truncated to 60 frames

Dataset splits are stratified (70% train, 10% val, 20% test). Full video files are not re-distributed; only meta-data and feature sets are accessible, subject to non-commercial, research-only license requirements and institutional approval.

6. Challenges, Limitations, and Prospects

Three core challenges are identified:

Low separability between Hateful and Offensive classes due to annotation ambiguity and subtle expression, especially in implicit hate.
Cultural and domain bias, particularly in vision models (e.g., YOLOv3 yields low detection rates for Chinese content relative to English).
Data sparsity in Hateful labels, restricting effective supervised model training.

Recommended directions include augmenting corpora with additional languages and non-Western content, developing more sophisticated multimodal fusion architectures (e.g., early fusion, cross-attention), incorporation of external world knowledge for implicit hate, and enriching targets (e.g., with race, religion), along with longer audiovisual samples.

7. Significance and Applications

MultiHateClip is the first benchmark to provide a comprehensive, cross-lingual, multimodal resource for video hate speech detection with fine-grained annotation of protected groups and explicit multimodal source labeling (Wang et al., 2024). The dataset enables:

Training and assessment of multimodal hate content classifiers
Comparative analysis of cross-cultural hate expressions in video
Exploration of modality-specific and joint cues in online hate
Investigations into model deficiencies and cultural generalization

These characteristics position MultiHateClip as a foundational tool for advancing both algorithmic and sociotechnical understanding of hateful content propagation in global video platforms.

PDF Markdown Chat (Pro)

References (1)

MultiHateClip: A Multilingual Benchmark Dataset for Hateful Video Detection on YouTube and Bilibili (2024)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to MultiHateClip Dataset.