The Enforcement and Feasibility of Hate Speech Moderation on Twitter

Published 14 Apr 2026 in cs.CY and cs.CL | (2604.12289v1)

Abstract: Online hate speech is associated with substantial social harms, yet it remains unclear how consistently platforms enforce hate speech policies or whether enforcement is feasible at scale. We address these questions through a global audit of hate speech moderation on Twitter (now X). Using a complete 24-hour snapshot of public tweets, we construct representative samples comprising 540,000 tweets annotated for hate speech by trained annotators across eight major languages. Five months after posting, 80% of hateful tweets remain online, including explicitly violent hate speech. Such tweets are no more likely to be removed than non-hateful tweets, with neither severity nor visibility increasing the likelihood of removal. We then examine whether these enforcement gaps reflect technical limits of large-scale moderation systems. While fully automated detection systems cannot reliably identify hate speech without generating large numbers of false positives, they effectively prioritize likely violations for human review. Simulations of a human-AI moderation pipeline indicate that substantially reducing user exposure to hate speech is economically feasible at a cost below existing regulatory penalties. These results suggest that the persistence of online hate cannot be explained by technical constraints alone but also reflects institutional choices in the allocation of moderation resources.

Abstract PDF Upgrade to Chat

Authors (10)

Summary

The paper shows that hate speech is minimally enforced, with only 20% of hateful tweets removed even five months after posting.
It employs a 24-hour, 540,000 tweet multilingual audit combined with a simulation of an AI-human triage pipeline for moderation.
Findings indicate notable disparities across languages and engagement levels, suggesting institutional resource allocation drives moderation gaps.

The Enforcement and Feasibility of Hate Speech Moderation on Twitter

Overview and Study Design

This paper presents an unprecedented, large-scale, multilingual audit of hate speech moderation enforcement and feasibility on Twitter (now X), utilizing a globally representative 24-hour snapshot amounting to 540,000 manually annotated tweets. The analysis targets two foundational questions for platform governance: the real enforcement rate of hate speech policies, and the empirical limitations and resource requirements for large-scale hate speech moderation, especially through algorithmic and human–AI hybrid approaches.

Figure 1: Study design for auditing hate speech moderation, encompassing global sampling, large-scale annotation, longitudinal enforcement assessment, and feasibility simulation of moderation pipelines.

Empirical Characterization of Hate Speech and its Moderation

Prevalence and Exposure

Hate speech is quantitatively rare (0.42% of tweets labeled hateful; 0.03% violent hate speech), but the absolute scale is non-trivial: this translates into approximately 225,000 hateful and 11,000 violent hateful tweets per day globally. The average user, under a chronological feed, is likely to encounter one hateful tweet daily and one violent hateful tweet every 17 days, with higher exposure probable in algorithmically curated feeds.

Figure 2: Language-level prevalence, engagement share, and target composition of hate speech across sampled languages.

Figure 3: Country-level prevalence, engagement share, and target composition of hate speech.

Removal and Enforcement Patterns

Post-hoc querying revealed the remarkable persistence of hate speech: five months after posting, 80% of hateful tweets were still accessible, including those with explicit violent content. The aggregate tweet removal rate for hate speech was only ~20%, but, critically, this encompasses removals for any reason (including voluntary deletion or deactivation). Platform-attributable removals (i.e., user suspensions) occur in only 9.2% of cases, with 9.0% of users posting hate being suspended.

Notably, hateful tweets are not statistically more likely to be removed compared to non-hateful tweets ( $\beta = -0.02$ , $p=0.84$ ), and there is no observed increase in enforcement for content classified as violent hate speech. The effect sizes are trivial and lack statistical significance, diverging sharply from policy-mandated harm prioritization. Conversely, other violation categories such as scams and adult content saw substantially higher removal rates.

Enforcement showed systematic variation across language and geography, with lower rates for Arabic (–12 percentage points) and Nigerian users ( $p=0.03$ ), and higher rates for Indonesian users (+11 percentage points, $p=0.037$ ), relative to a US English baseline.

Figure 4: Low, severity-unresponsive hate speech enforcement and removal likelihood across content categories, with negative/insignificant correlation between engagement and removal probability.

Further, enforcement rates decrease with tweet engagement; more widely seen (higher engagement) hateful tweets are less likely to be removed ( $\beta = -0.05$ , $p=0.01$ ). No evidence supports systematic deployment of algorithmic visibility reduction ("downranking"), as measured by post-engagement metrics.

Technical Feasibility and Automation Constraints

AI Detection Capability

The study extensively benchmarks both traditional supervised models (including state-of-the-art transformers) and GPT-5.1 in a multilingual setting. All models were evaluated out-of-sample using the representative annotated dataset.

Supervised models and GPT-5.1 fail to provide both high recall and high precision on real-world data ([AP] maxes out at 0.28-0.31 for European languages; much lower in others), even with optimal thresholding.
Main source of false positives: Offensive but not policy-violating content (38% of the top-0.5% highest-scoring tweets on average are merely offensive, not hateful).
Contextually ambiguous, quoted, or referenced hate speech (not direct use) is a recurring error source.
Nevertheless, these models robustly rank actual hateful content near the top of the score distribution: a small percentile (top 3.2%) captures ≈50% of hateful tweets.
Figure 5: (A) Poor precision–recall trade-off for both supervised models and GPT-5.1 across languages; (B) Top 0.5% hate-score bins are dominated by non-hateful content; (C) Hateful tweets are consistently ranked near the scoring apex.

Human–AI Pipeline Simulation

Given the limitations of full automation, the study simulates a triage pipeline: AI surfaces top-scoring tweets for human moderator review, prioritizing by predicted hatefulness × engagement.

With current reported moderator staffing (from Twitter’s DSA filings) and best-in-class models, only 24.5% of engagement with hate speech could be avoided. Substantial disparities appear across languages, reflecting not only detection bottlenecks but also allocation misalignments—Indonesian and Arabic are especially under-resourced.

Cost modeling establishes that preventing 80% of engagement with hate speech is feasible at 2.9% of global revenue—well below statutory fine maxima (6–10% of revenue) under the EU DSA and UK Online Safety Act. Achieving 80% coverage of all hate content (as opposed to engagement-weighted) would be prohibitively expensive (19–61% of revenue) due to the extreme class imbalance and model errors.

Figure 6: (A) Pipeline schematic showing AI-driven triage for human review; (B) Avoided engagement with hate as a function of moderator count (log-scale) per language; (C) Avoided hate engagement versus moderation cost as a share of Twitter global revenue, compared to regulatory fine ceilings.

Implications, Limitations, and Future Directions

Theoretical and Practical Ramifications

The research empirically invalidates the hypothesis that platform-scale hate speech persistence can be explained by technical infeasibility alone. Instead, it evidences that institutional priorities and the allocation of moderation resources drive enforcement gaps. The data-driven audit illuminates, with statistical rigor, the limitations of both standalone AI moderation and current organizational configurations for multilingual, global platforms.

The robust finding that current models are incapable of high-fidelity discrimination between hate and mere offense in realistic, imbalanced, and linguistically diverse settings has direct consequences for the design of algorithmic content moderation systems, reinforcing the necessity for human-in-the-loop review for all but the most clear-cut violations.

Disparities in moderation allocation by language and geography raise serious governance and regulatory equity concerns, especially given evidence of underinvestment in high-need language contexts, corroborated by recent external studies [tonneau2025language, elswah2023does].

On the financial and regulatory compliance side, the demonstrated tractability (sub-3% revenue) of achieving policy-relevant hate speech exposure reduction with hybrid workflows directly informs platform risk modeling and provides actionable benchmarks for regulatory bodies calibrating fines or compliance metrics.

Methodological and Scope Limitations

Analysis is restricted to textual hate speech, excluding images, video, or multimodal contexts that may require distinct models and operational flows.
The sample represents a single day, and may not capture temporal variation due to evolving platform policies or external events.
Cost-benefit simulation assumes upper-bound resource allocation; practical requirements may fall lower due to moderator expertise, workflow optimization, or improved AI in the future.

Speculation on Future Developments

Continued advances in detection model architecture and pre-trained representation quality may gradually reduce (but not eliminate) the error floor observed in real-world hate/offense boundary detection, especially for under-resourced languages. Enhanced explainability and context modeling (e.g., multi-message thread context, user-level modeling) could marginally improve precision without a concomitant collapse in recall.

Policy and organizational changes that more closely align moderator allocation with exposure risk—potentially supported by real-time engagement-driven triage—could substantially improve real-world outcomes at sustainable cost. However, resolving governance and equity issues surrounding disparate resource allocation is likely to require regulatory intervention rather than technical advances alone.

Conclusion

This global audit substantiates that hate speech policies on Twitter are minimally and inconsistently enforced, with removals unresponsive to content severity, reach, or platform-stated priorities. Current AI systems are inadequate for reliable automated moderation, although their ranking capacity enables cost-effective human–AI triage capable of major exposure reductions within existing revenue constraints. Persistent hate speech exposure is thus a function of institutional choices rather than technical necessity; effective mitigation requires both investment in human oversight and organizational commitment to aligning resource allocation with harm reduction goals. Regulatory pressure, backed by cost-aware benchmarks such as those developed in this study, will be pivotal in correcting these systemic misalignments.

(2604.12289)

Markdown Report Issue