SoftHateBench: Framework for Soft Hate Speech

Updated 4 February 2026

SoftHateBench is a generative evaluation framework designed to assess moderation models’ ability to detect reasoning-driven soft hate speech that hides behind policy-compliant language.
It integrates the Argumentum Model of Topics and Relevance Theory to generate adversarial soft hate instances by reversing hostile argument chains.
The benchmark reveals significant detection gaps in today’s moderation systems, highlighting the need for improved methodologies to capture subtle, inferential hate speech.

SoftHateBench is a generative evaluation framework designed to assess moderation models’ ability to detect "soft hate speech"—hostility conveyed through reasoning-driven, policy-compliant language rather than explicit toxicity. It systematically contrasts conventional hard hate with nuanced, inferential variants, integrating argument structure modeling with pragmatic, relevance-driven cue selection. The benchmark provides a large, domain-diverse resource for precisely diagnosing the blind spots of modern hate speech detection systems, especially those based on LLMs or surface-level classifiers (Su et al., 28 Jan 2026).

1. Conceptual Distinction and Motivation

Hard hate speech denotes overt hostility: explicit slurs, threats, and aggressive euphemism, typically rich in lexical and surface signals. In contrast, soft hate speech—or "soft hate"—is articulated without obvious insult or threat. It frames hostility within plausible, even policy-oriented premises (e.g., appeals to public safety, tradition, or “common sense”), remaining superficially compliant with moderation policy while subtly steering the reader toward bias or exclusion against a target group. For example, language critiquing “public institutions banning clothing that promotes a particular ideology” implicitly targets groups while disguising exclusionary intent.

Prevailing moderation systems are trained primarily on detecting surface toxicity signals. Classifier ensembles, keyword-based blocks, and LLMs optimized for toxic language cues systematically fail when confronted with soft hate, where the antagonism is embedded in value-based reasoning rather than direct verbal aggression. SoftHateBench highlights this failure mode by providing adversarially generated soft-hate instances traceable directly to hard-hate standpoints, thus exposing the reasoning gap in existing moderation models (Su et al., 28 Jan 2026).

2. Underlying Theoretical Framework

SoftHateBench’s generation of soft hate employs two foundational models: the Argumentum Model of Topics (AMT) and Relevance Theory (RT).

Argumentum Model of Topics (AMT)

AMT formalizes an argument as a defeasible inference chain,

$\mathcal{A} = (E, D, P, L, M, S)$

where

$E$ : Endoxon, a shared value or societal belief;
$D$ : Datum, application of $E$ to the target group ([TG]);
$P$ : Premise arising from $(E, D)$ ;
$L$ : Locus, the abstract argumentative scheme (e.g., moral value);
$M$ : Maxim, an action rule instantiated from $L$ ;
$S$ : Standpoint, expressing the ultimate hostile position.

AMT encodes two defeasible inference steps: $E$ 0

$E$ 1

Relevance Theory (RT)

RT models pragmatic interpretation as a balance of informational Effect against cognitive Cost: $E$ 2

Effect is estimated via NLI entailment metrics (favoring high entailment and low contradiction);
Cost is a composite of NLI-based resistance, surprisal, next-token entropy, and a redundancy penalty, with all dimensions computed using ensemble LLM model scores.

AMT structures ensure logical coherence and plausible stance preservation, while RT provides a selection principle favoring argument chains most likely to be effective yet subtle in steering reader inference. The unified AMT–RT framework grounds the benchmark’s generative pipeline (Su et al., 28 Jan 2026).

3. Benchmark Generation and Structure

Benchmark creation proceeds in four stages:

Stage 1: Seed Extraction

Collected ~266,000 posts from six public hate-speech corpora.
Filtering (semantic deduplication, content filters, ensemble detector consensus) yields 16,426 hard-hate instances.
Extraction of the hostile standpoint $E$ 3 and associated target group ([TG]) is performed via DeepSeek-V3.1 LLM.

Stage 2: Reverse AMT Generation

For each $E$ $E$ 4, an RT-guided beam search inverts the AMT chain:
1. Select locus $E$ 5.
2. Generate candidate premises $E$ 6 and maxims $E$ 7.
3. Decompose $E$ 8 to endoxon–datum pairs $E$ 9.
4. Score edges $D$ 0 as $D$ 1.
5. Aggregate within ensemble models and select the maximum relevance chain.
Output is a logically coherent, relevant chain preserving the original hostility.

High-level pseudocode:

$P$ 4

Stage 3: Benchmark Selection

For each of 28 Level-2 target groups across 7 Level-1 domains (race/ethnicity, religion, gender, socio-economic class, politics/ideology, sexual orientation, nationality/region), retain the top 300 high-relevance generated soft-hate variants.
Manual verification ensures policy compliance and logical coherence.
Finalized as 4,745 validated “base” soft-hate instances.

Stage 4: Difficulty Augmentation

GroupVague (GV): Replace explicit [TG] with coded descriptors (e.g., “women of that faith”) using best-of- $D$ 2 selection that preserves semantic/inferential equivalence.
HostilityVague (HV): Naturalistic LLM post omits the explicit target but preserves the argument structure and stance, again selected by best-of- $D$ 3.
Final dataset: Each hard-hate seed yields three soft tiers—Soft $D$ 4, Soft $D$ 5, Soft $D$ 6—plus the original hard-hate item for comparison.

Summary statistics:

Domain count	Group count	Base soft-hate	Soft variants	Total examples
7	28	4,745	14,235	18,980

4. Experimental Protocol and Evaluation Metrics

Models Assessed

Encoder-based classifiers: HateBERT (IMSyPP), HateBERT (GroNLP), HateRoBERTa (off-the-shelf checkpoints).
Proprietary LLMs: DeepSeek-V3.1, GPT5-mini.
Open-source LLMs: GPT-OSS-20B (chain-of-thought), Gemma3-4B*, Llama3.2-3B*, Qwen3-4B* (all evaluated in zero-shot moderation format).
Safety-specialized LLMs: ShieldGemma-2B, LlamaGuard3-1B, Qwen3Guard-4B.

All LLMs use deterministic decoding ( $D$ 7) under a unified content moderation prompt.

Metric

Hate Success Rate (HSR): Fraction of systematically hostile items accurately labeled “hateful” by the model. All test items are hostile; HSR functions as a pure recall measure.

Evaluation Protocol

Each model is evaluated in three runs for mean performance per tier and per domain.
All tiers (hard plus Soft $D$ 8, Soft $D$ 9, Soft $E$ 0) are included.

5. Empirical Results and Diagnostic Analyses

Core Findings (select HSR results):

Model	Hard	Soft $E$ 1	Δ	Soft $E$ 2	Δ	Soft $E$ 3	Δ
Encoder avg	57.3%	16.8%	–40.5	11.3%	–46.0	6.8%	–50.5
Proprietary avg	87.3%	53.1%	–34.2	33.9%	–53.4	37.2%	–50.1
Open-source avg	91.1%	65.0%	–26.2	45.0%	–46.1	26.5%	–64.6
Safety avg	70.4%	35.1%	–35.3	37.7%	–32.8	17.8%	–52.6
Overall avg	76.8%	43.5%	–33.4	32.9%	–43.9	21.2%	–55.7

Systematic drops of 20–60 percentage points from hard to soft tiers punctuate a consistent trend: all detector classes, even advanced LLMs and safety-specialized models, fail precipitously when hostile content is enveloped in reasoning and plausible deniability. Encoders relying on surface cues fail completely on Soft $E$ 4 (6.8% mean HSR). The top proprietary model (GPT5-mini) drops from 91.6% (hard) to 70.4% (Soft $E$ 5), with further declines on obfuscated variants.

Domain-Specific Observations

Socio-economic class and politics/ideology domains are notably recalcitrant under softening; “Elite” subclass examples are almost undetectable in Soft $E$ 6.
In contrast, “Working class” remains more detectable, indicating variability in cue preservation across social targets.

Ablation and Inferential Diagnostic

Explicitly providing intermediate AMT elements—premise ( $E$ 7), maxim ( $E$ 8)—to instruction-tuned LLMs during moderation nearly fully restores system recall. For instance, Qwen3-4B performance on Soft $E$ 9 rises from 23% when given only $P$ 0 to 92% with the full chain ( $P$ 1). This demonstrates that detection failures are due to an inability to reconstruct implicit hostile reasoning, not to an incapacity to recognize the ultimate hostile stance.

6. Implications, Limitations, and Recommendations

The consistent deficit in model performance on soft hate is ascribed to three interconnected factors:

Training regimes and architectures optimized for surface toxicity and keyword matching.
Blindness to reasoning-driven hostility that proceeds through chains of plausible value judgments to hostile standpoints.
Limitations of generic chain-of-thought prompting, unless the explicit structure of AMT is invoked.

Recommendations for improving model robustness:

Integrate argument-structure detection, specifically training classifiers or LLMs to infer latent premises ( $P$ 2) and applied maxims ( $P$ 3), or applying direct AMT-style annotation supervision.
Incorporate pragmatic markers: NLI-based entailment and cost/effect analysis to flag hostile reasoning chains.
Expand training distributions to include soft-hate instances systematically generated using frameworks like SoftHateBench.
Refine moderation policies to enumerate and encode reasoning patterns (e.g., cause–effect, appeals to tradition) that are commonly abused in soft hate.

All code, data, and moderation prompts are openly available at the provided Hugging Face repository (Shelly97/SoftHateBench) enabling reproducible and extensible research into reasoning-driven, policy-compliant hostility detection (Su et al., 28 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

SoftHateBench: Evaluating Moderation Models Against Reasoning-Driven, Policy-Compliant Hostility (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SoftHateBench.

SoftHateBench: Framework for Soft Hate Speech

1. Conceptual Distinction and Motivation

2. Underlying Theoretical Framework

Argumentum Model of Topics (AMT)

Relevance Theory (RT)

3. Benchmark Generation and Structure

Stage 1: Seed Extraction

Stage 2: Reverse AMT Generation

Stage 3: Benchmark Selection

Stage 4: Difficulty Augmentation

4. Experimental Protocol and Evaluation Metrics

Models Assessed

Metric

Evaluation Protocol

5. Empirical Results and Diagnostic Analyses

Core Findings (select HSR results):

Domain-Specific Observations

Ablation and Inferential Diagnostic

6. Implications, Limitations, and Recommendations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

SoftHateBench: Framework for Soft Hate Speech

1. Conceptual Distinction and Motivation

2. Underlying Theoretical Framework

Argumentum Model of Topics (AMT)

Relevance Theory (RT)

3. Benchmark Generation and Structure

Stage 1: Seed Extraction

Stage 2: Reverse AMT Generation

Stage 3: Benchmark Selection

Stage 4: Difficulty Augmentation

4. Experimental Protocol and Evaluation Metrics

Models Assessed

Metric

Evaluation Protocol

5. Empirical Results and Diagnostic Analyses

Core Findings (select HSR results):

Domain-Specific Observations

Ablation and Inferential Diagnostic

6. Implications, Limitations, and Recommendations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research