Harmful Speech Detection by LLMs Exhibits Gender-Queer Dialect Bias
In the paper titled "Harmful Speech Detection by LLMs Exhibits Gender-Queer Dialect Bias," Dorn et al. scrutinize the performance of LLMs in classifying harmful speech within gender-queer dialects. The paper highlights potential biases in content moderation systems, formulated around a novel dataset called QueerReclaimLex.
Dataset and Methodology
QueerReclaimLex is an instrumental part of this paper, created to probe the biases of LLMs toward non-derogatory uses of LGBTQ+ slurs. The dataset is derived from the NB-TwitCorpus3M, which contains approximately 3 million tweets from users with non-binary pronouns listed in their biographies. Each instance in the dataset is formed by substituting slurs into curated templates from real tweets authored by non-binary individuals. This method results in a collection of posts exemplifying linguistic reclamation of derogatory terms. Annotators, all identifying as gender-queer, systematically labeled these instances for harmfulness based on whether the author was part of the in-group or out-group of the slur.
The paper evaluates five off-the-shelf LLMs: Detoxify, Perspective (toxicity classifiers), and GPT-3.5, LLaMA 2, and Mistral (LLMs). Different prompting schemas are tested to provide models with additional context about speaker identity, including vanilla (zero-shot), identity (author's in-group/out-group status), and identity-cot (chain-of-thought prompting).
Key Findings
Annotator Agreement and Harm Assessment
Annotator agreement was notably higher for posts authored by in-group members as compared to out-group members, measured by Cohen's kappa coefficients (0.80 for in-group versus 0.60 for out-group). In-group posts were labeled as harmful in 15.5% of instances, whereas out-group posts reached 82.4%. This delineates a significant differential in judgments of harm based on the author’s inclusion in the targeted group. Specific forms of slur use, such as 'Group Label' and 'Sarcasm,' were more likely to be harmful when used by in-group members, suggesting intra-group derogation. Conversely, slur uses embedded in quotes or discussions, particularly concerning identity, were less likely to be deemed harmful when used by out-group members.
LLMs and Speaker Identity
Toxicity classifiers like Detoxify and Perspective demonstrated substantial false positive rates in identifying harmful speech within gender-queer dialects authored by in-group members, with F1 scores not exceeding 0.25. This suggests that these classifiers over-rely on the presence of slurs without considering nuanced contextual cues, leading to the inadvertent marginalization of gender-queer voices.
The paper revealed that LLMs also struggle with identifying non-derogatory uses of slurs, reflected in poor F1 scores for the vanilla schema (F1 0.36). Although the introduction of in-group/out-group identity context improved model performance slightly, false positive rates for ingroup speech remained notably high. The integration of chain-of-thought reasoning through identity-cot prompted some improvement, yet the models consistently failed to achieve satisfactory precision.
An additional analysis on the dataset subset, where the posts contained clear contextual indicators of in-group membership, showed that model performance remained abysmally low (F1 0.24). This indicates that LLMs do not sufficiently leverage context to accurately assess the harm of slur use within gender-queer dialects.
Slur-Specific Model Behavior
The paper also highlighted that models assign different levels of harm to various slurs. For example, slurs such as 'fag,' 'shemale,' and 'tranny' were consistently scored higher in harm across models. The dependence of harm scores on specific slurs decreased when identity context and chain-of-thought prompting were employed. This suggests that additional context can partially mitigate spurious correlations between slurs and determined harm.
Implications and Future Research
This work underscores the need for refining content moderation algorithms to fairly represent and protect gender-queer communities. The significant false positive rates in the detection of harmful speech by both toxicity classifiers and LLMs signal an urgent need for models that incorporate nuanced linguistic and contextual cues. This could entail the development of datasets rich in in-group speech or improved models that can dynamically integrate identity context.
Future research could expand upon this work by exploring a more diverse range of marginalized communities or by experimenting with training models to align more closely with the linguistic norms of these groups. Additionally, supplementing LLMs with annotated data from a wider demographic could calibrate more equitable moderation practices.
The findings presented by Dorn et al. are crucial for informing the development of more inclusive AI systems, enhancing fairness in digital discourse, and ensuring the responsible deployment of technology in online social spaces.