NLP Systems That Can't Tell Use from Mention Censor Counterspeech, but Teaching the Distinction Helps (2404.01651v1)

Published 2 Apr 2024 in cs.CL, cs.CY, cs.HC, and cs.SI

Abstract: The use of words to convey speaker's intent is traditionally distinguished from the `mention' of words for quoting what someone said, or pointing out properties of a word. Here we show that computationally modeling this use-mention distinction is crucial for dealing with counterspeech online. Counterspeech that refutes problematic content often mentions harmful language but is not harmful itself (e.g., calling a vaccine dangerous is not the same as expressing disapproval of someone for calling vaccines dangerous). We show that even recent LLMs fail at distinguishing use from mention, and that this failure propagates to two key downstream tasks: misinformation and hate speech detection, resulting in censorship of counterspeech. We introduce prompting mitigations that teach the use-mention distinction, and show they reduce these errors. Our work highlights the importance of the use-mention distinction for NLP and CSS and offers ways to address it.

Citations (5)

View on Semantic Scholar

Summary

The paper demonstrates that NLP models erroneously censor counterspeech by conflating the actual use of problematic language with its mention.
It evaluates use-mention classification and downstream tasks using paired statement instances from established datasets.
Mitigation strategies such as few-shot examples and chain-of-thought prompting significantly reduce misclassification errors.

Addressing the Use-Mention Distinction in NLP Systems for More Accurate CounterSpeech Identification

Introduction to the Use-Mention Distinction

In NLP, particularly in the context of online content moderation, distinguishing between the use of offensive or problematic language and the mention of such language is critical but challenging. The distinction becomes even more vital when considering counterspeech—a type of content that aims to refute hate speech or misinformation by mentioning it without endorsing it. Current NLP models, including state-of-the-art LLMs, often struggle to make this distinction, leading to the erroneous censorship of counterspeech. This paper presents a comprehensive paper focusing on this issue, examining the implications of failures in recognizing the use-mention distinction and proposing mitigation strategies.

Technical Challenges and Methodology

The authors identify and formalize two key tasks: use-mention classification and downstream content classification. The former involves classifying text as either using problematic language or merely mentioning it, while the latter entails standard hate speech and misinformation detection with an added challenge—correctly handling mentions of problematic content. The paper utilizes pairs of statement instances (true use and counterspeech mention) from existing datasets, highlighting the complexity of distinguishing between the two.

Several models are evaluated on their ability to correctly identify use versus mention and on their performance in downstream tasks. The investigation reveals that LLMs exhibit high error rates in both tasks, compromising the reliability of NLP-based content moderation systems.

Findings and Analysis

The paper's results indicate significant shortcomings in current models' ability to distinguish between the use and mention of problematic language. These deficiencies not only impact the use-mention classification task but also carry over to downstream tasks, leading to the unwarranted censorship of counterspeech. The analysis further dissects the nature of these errors, identifying several factors that contribute to the misclassification of mentions as harmful uses—including the presence of specific identity terms, COVID-19-related terms, and the strength of stance expressed in the surrounding context.

Interestingly, the paper also highlights how the presence of quotation marks, often used to denote mentions, paradoxically increases the likelihood of counterspeech being misclassified as harmful, pointing to a surface-level reliance on textual features by NLP models.

Implications and Future Directions

The inability of NLP systems to accurately recognize the use-mention distinction has profound implications for online discourse, particularly in content moderation practices. Erroneous classification of counterspeech as harmful can stifle vital discussions, undermine efforts to combat hate speech and misinformation, and disproportionately affect marginalized communities.

In response to these challenges, the paper explores several mitigation strategies, including the use of few-shot examples and chain-of-thought prompting, which show promise in reducing classification errors. These findings underscore the need for more nuanced approaches to training and deploying NLP models, emphasizing the importance of context and the intended use of language in classification tasks.

Conclusion

This paper contributes to our understanding of a key challenge in the field of NLP—accurately distinguishing between the use and mention of problematic language. By highlighting the limitations of current models and proposing effective mitigation strategies, the authors pave the way for more accurate and fair content moderation practices. As NLP continues to evolve, addressing the use-mention distinction will be crucial in ensuring that technology supports, rather than hinders, healthy online environments.

PDF Markdown

Related Papers

Tweets

https://twitter.com/krisgligoric/status/1802505536958537869

https://twitter.com/krisgligoric/status/1775948312212013127

YouTube

Show All Videos