RAL-E: Reddit Abusive Language English Dataset
- RAL-E is a large-scale dataset of over 1.4M Reddit comments sourced from banned English subreddits, primarily serving as a pre-training corpus for HateBERT.
- Its preprocessing pipeline standardizes user mentions, URLs, and emojis, ensuring linguistic uniformity for effective neural model training.
- Despite its scale, the dataset lacks explicit instance-level annotations and shows significant skew toward certain communities, impacting generalizability.
The Reddit Abusive Language English (RAL-E) dataset comprises a large-scale collection of Reddit comments sourced exclusively from English-language subreddits that were banned for promoting offensive, abusive, or hateful content. Developed to support neural models of abusive language detection, RAL-E features over 1.4 million comments from communities censured by Reddit administrators and, notably, serves as the pre-training corpus for HateBERT, a domain-adapted variant of BERT optimized for hostile language phenomena (Caselli et al., 2020).
1. Dataset Construction and Provenance
RAL-E was constructed by aggregating comments from a curated list of banned Reddit communities. The candidate subreddits were enumerated from official Reddit announcements and the “Controversial Reddit communities” Wikipedia page. Comments were extracted from the Reddit public comment archive, covering December 2005 to March 2017; however, filtering for relevance yielded a final temporal span of January 2012 through June 2015.
Inclusion criteria were strict: only comments from the dedicated banned-community list were retained. All non-textual metadata was discarded except for the raw comment and the subreddit identifier. The corpus is exclusively English, relying on subreddit affiliation as a proxy for abusive content due to community-level bans enacted for harassment, incitement of violence, or hate.
2. Preprocessing Pipeline
The preprocessing pipeline applied multiple standardization steps to maximize linguistic uniformity and facilitate model consumption (Caselli et al., 2020):
- User mentions were mapped to the token
@USER. - URLs were replaced with the token
URL. - Emojis were normalized to descriptive strings via the Python emoji package (e.g., 😊 →
:smiling_face_with_smiling_eyes:). - All hashtag symbols (
#) were removed from tokens. - Multiple consecutive spaces were collapsed to single spaces, and blank lines were removed.
No further filtering regarding length or language was performed beyond subreddit-level selection.
3. Corpus Statistics and Composition
RAL-E’s final size is 1,492,740 comments, comprising 43,820,621 tokens. The corpus does not include manual annotation—no explicit abusive, hateful, or offensive labels exist at the comment level. Instead, the dataset's abusive character is implicit, arising from the curation strategy of sourcing only banned communities. This method leads to substantial class skew by subreddit:
| Subreddit | # of Comments | Proportion |
|---|---|---|
| fatpeoplehate | 1,465,531 | ≈0.982 |
| sjwhate | 10,080 | ≈0.0068 |
| milliondollarextreme | 9,543 | ≈0.0064 |
A plausible implication is that models trained on RAL-E may predominantly learn abusive language patterns characteristic of fatpeoplehate, with minority influence from other communities.
4. Utilization in Downstream Modeling
RAL-E is an unlabeled corpus, designed exclusively for further pre-training of transformer-based LLMs. HateBERT, introduced by Caselli et al., was trained on RAL-E and subsequently demonstrated superior performance on multiple English datasets addressing offensive, abusive language, and hate speech detection tasks relative to standard BERT (Caselli et al., 2020).
Experiments reveal improved portability of representations when pre-training on abuse-centric corpora, but performance and adaptability are contingent on the compatibility of annotated phenomena between source and target datasets.
5. Comparison to Fine-Grained Offensiveness Norms
RAL-E’s principal utility is in domain-adapted representation learning. In contrast, resources such as Ruddit provide crowd-annotated, fine-grained offensiveness scores for diverse Reddit content, employing Best–Worst Scaling (BWS) over a continuous interval (Hada et al., 2021). Ruddit’s annotation methodology yields reliable real-valued scores (e.g., SHR Pearson ), allowing direct regression benchmarking; RAL-E supplies discrete or distributional labels on , with focus restricted to abusive or banned subreddits. HateBERT’s superior results on Ruddit attest to the generalizability of RAL-E-based pre-training for offensive language detection, but highlight limitations in coverage and score granularity for downstream regression tasks.
6. Limitations and Potential Applications
RAL-E's principal limitation is its lack of explicit, instance-level annotation. It should be understood as a domain corpus rather than a supervised dataset. The extreme skew toward particular banned communities suggests caution in extrapolating results to the broader Reddit landscape or other social media environments. Nevertheless, its value in training robust, abuse-adapted LLMs is empirically supported by cross-dataset evaluation, most notably in the architecture and deployment of HateBERT. RAL-E remains a reference corpus for pre-training models intended to operate in hostile, abusive, or incivility-inflected online contexts (Caselli et al., 2020).