RV-HATE: Modular Ensemble for Hate Speech
- RV-HATE is a modular ensemble framework for implicit hate speech detection that integrates reinforcement learning-based weight selection to adapt to diverse dataset characteristics.
- It combines four specialized modules—clustering-based contrastive learning, target-tagging, outlier removal, and hard negative sampling—each fine-tuned on dataset-specific cues.
- The framework leverages PPO to optimize ensemble weights, resulting in improved macro-F1 scores and interpretable attributions of module importance per dataset.
RV-HATE is a modular ensemble framework for implicit hate speech detection that employs reinforcement-learned soft voting to optimize dataset-specific performance. The architecture is explicitly designed to address the heterogeneity of hate speech datasets, which arise from divergent linguistic patterns, social contexts, and annotation schemes. By integrating multiple specialized modules and adapting their ensemble weights to each target dataset via policy optimization, RV-HATE achieves both improved classification accuracy and quantitative interpretability with respect to critical features for a given corpus (Lee et al., 13 Oct 2025).
1. Multi-Module Architecture
RV-HATE comprises four distinct modules, each producing a two-class logit vector for input text . Modules are independently fine-tuned and are as follows:
- : Clustering-based Contrastive Learning
- Input: raw sentence
- Encoder: BERT-base generates embedding
- Clustering: training embeddings clustered into clusters; center is cluster mean
- Anchor selection:
- Contrastive loss (SharedCon, cosine):
where is a positive pair for anchor 0, 1 is temperature.
2: Target-Tagging with [TARGET] Tokens
- Input: 3 with NER-derived “[TARGET]” spans, marking ORG/NORP/GPE entities (spaCy + GPT-4o)
- Encoder: BERT-base, contrastive objective 4 as 5, but with tagged input.
- 6: Outlier Removal within Clusters
- Procedure as in 7, but with outliers removed. Distance 8; outliers above 9 are excluded before computing loss 0.
- 1: Hard Negative Sampling
- Maintains a queue 2 of hard negatives (samples of opposing class with high similarity or false positives with high confidence)
- Contrastive loss 3 draws negatives from both batch and 4.
Each module outputs logits via a classification head 5.
2. Reinforcement Learning-Based Weight Selection
Weights 6, 7, 8, modulate module contributions for a specific dataset. Weight selection is formulated as a one-step Markov decision process:
- State: compact vector of dataset statistics, e.g., ratio of “[TARGET]” tags, outlier rate, implicit hate ratio.
- Action: weights 9 in the four-dimensional simplex 0.
- Policy: 1, parameterized via a two-layer MLP, outputs Dirichlet or softmax pre-weights.
- Reward: macro-F1 score on the validation set for predictions with weights 2:
3
- Optimization: Proximal Policy Optimization (PPO) is used, with surrogate loss
4
where 5 is the ratio 6 and 7.
A single PPO policy is trained for 8 steps to optimally select 9 per dataset.
3. Ensemble Voting and Prediction
For sample 0, each module 1 outputs logits 2 for classes 3. The ensemble logit for class 4 is
5
Final prediction is
6
Since weights are nonnegative and sum to one, further normalization is unnecessary.
4. Training Procedure and Dataset Adaptation
Each module is independently trained for six epochs on the target dataset via the objective
7
where 8 is module-specific contrastive loss, 9 is cross-entropy with ground-truth label 0, and 1.
After training and freezing module parameters, PPO optimizes voting weights 2 based on dataset-specific state. The learned policy generates test-time weights 3, with macro-F1 evaluated on the test partition. This two-stage approach (independent module adaptation, then ensemble weight optimization) ensures both flexibility and dataset sensitivity.
5. Interpretability and Attribution
RV-HATE’s learned weights 4 provide quantitative attribution of module importance for each dataset. For example, on the IHC dataset, mean learned weights are 5 for 6 through 7 respectively. Ablation, by zeroing out each 8 and renormalizing, yields 9, directly quantifying the macro-F1 impact of every module. These two metrics together—module weights and ablation performance—yield interpretable insights into which linguistic or contextual properties are most predictive per corpus. This suggests that RV-HATE not only adapts to but exposes data-specific cues and vulnerabilities.
6. Empirical Results and Dataset Coverage
The framework was evaluated on five English hate speech benchmarks:
| Dataset | Instances | Characteristics |
|---|---|---|
| IHC | 22 K | Implicit hate, human implications (tweets) |
| SBIC | 150 K | Offensiveness and target-entity labels |
| DYNA | ~41 K | Adversarially constructed hate speech |
| Hateval | 13 K | Targets: immigrants/women, Twitter-based |
| Toxigen | 6 K | Machine-generated toxic/benign examples |
Performance comparison (macro-F1; average of 3 seeds):
| Model | IHC | SBIC | DYNA | Hateval | Toxigen | Avg |
|---|---|---|---|---|---|---|
| CE | 77.70 | 83.80 | 78.80 | 81.11 | 90.06 | 82.29 |
| SCL | 77.81 | 82.92 | 80.39 | 81.28 | 90.75 | 82.63 |
| SharedCon (SOTA) | 78.50 | 84.30 | 79.10 | 80.24 | 91.21 | 82.67 |
| LAHN | 78.40 | 83.98 | 79.64 | 80.42 | 90.42 | 82.57 |
| RV-HATE | 79.07 | 84.62 | 81.82 | 83.44 | 93.41 | 84.47 |
RV-HATE yields a mean improvement of +1.8 percentage points in macro-F1 over SharedCon, indicating the efficacy of dataset-aware modular weighting.
7. Technical Configuration and Resources
Key implementation specifications include:
- Backbone: BERT-base-uncased (110M parameters)
- Embeddings for contrastive sampling: SimCSE (unsupervised)
- Optimizer: AdamW, learning rate 0 or 1, batch size 32
- Contrastive temperature 2, 3, cluster count 4
- PPO: 10,000 steps, 5, advantage via GAE, policy MLP with 2 layers (64 units)
- Hardware: NVIDIA RTX 4090, 3 random seeds
A plausible implication is that the modular and dataset-conditioned design of RV-HATE is well-suited to fields characterized by substantial domain and distribution drift, as both its architecture and adaptation procedure are systematized to expose and leverage dataset-specific linguistic phenomena (Lee et al., 13 Oct 2025).