Toxicity in ChatGPT: Analyzing Persona-assigned Language Models (2304.05335v1)

Published 11 Apr 2023 in cs.CL, cs.AI, and cs.LG

Abstract: LLMs have shown incredible capabilities and transcended the NLP community, with adoption throughout many services like healthcare, therapy, education, and customer service. Since users include people with critical information needs like students or patients engaging with chatbots, the safety of these systems is of prime importance. Therefore, a clear understanding of the capabilities and limitations of LLMs is necessary. To this end, we systematically evaluate toxicity in over half a million generations of ChatGPT, a popular dialogue-based LLM. We find that setting the system parameter of ChatGPT by assigning it a persona, say that of the boxer Muhammad Ali, significantly increases the toxicity of generations. Depending on the persona assigned to ChatGPT, its toxicity can increase up to 6x, with outputs engaging in incorrect stereotypes, harmful dialogue, and hurtful opinions. This may be potentially defamatory to the persona and harmful to an unsuspecting user. Furthermore, we find concerning patterns where specific entities (e.g., certain races) are targeted more than others (3x more) irrespective of the assigned persona, that reflect inherent discriminatory biases in the model. We hope that our findings inspire the broader AI community to rethink the efficacy of current safety guardrails and develop better techniques that lead to robust, safe, and trustworthy AI systems.

PDF Abstract

Insights into Toxicity in ChatGPT with Persona Assignments

This paper presents a comprehensive analysis of how persona assignments affect the toxicity level in text generation by ChatGPT, a popular LLM. ChatGPT's use extends across various sectors, including healthcare, education, and customer service, highlighting the importance of ensuring that its outputs remain non-toxic and non-biased. This research provides a systematic evaluation of toxicity resulting from over half a million ChatGPT-generated texts.

Main Findings

The paper indicates a significant increase in toxicity when ChatGPT is assigned a specific persona. By assigning personas, such as historical figures like Muhammad Ali or contentious ones like Adolf Hitler, the toxicity of generated outputs can increase up to sixfold. The paper asserts that the model exhibits heightened bias and unfounded opinions, targeting specific entities more than others, regardless of the persona assigned. Here are major highlights of their findings:

Baseline Person Assignments: Assigning personas with a neutral or positive connotation, such as "a good person," resulted in relatively low toxicity scores. However, when personas suggesting negative traits were used, such as "a bad person," the model displayed significantly higher toxicity levels. This indicates that the persona assigned directly influences the nature of the generated content.
Persona-Specific Toxicity Variation: Different categories of personas, such as dictators, showed higher toxicity levels compared to others like businesspersons and sports figures. Within each persona category, the toxicity varied substantially, with some prominent political figures yielding up to three times more toxicity in outputs than others.
Entity-Specific Bias: The paper reveals that certain groups, races, and sexual orientations receive disproportionately higher toxicity in generated outputs. This bias is illustrated by significantly more toxic outputs aimed at entities related to countries with colonial histories and more toxicity towards non-binary genders compared to others.
Impact of Prompt Styles: The prompt style also influences the toxicity, with prompts explicitly soliciting negative output generating higher toxicity levels. This aspect indicates susceptibility to prompt engineering, which could be exploited if unchecked.

Methodologies Employed

The research utilized the perspectiveapi for toxicity measurement, generating multiple responses for each pair of persona and entity to compute the Probability of Responding (por). Variations in toxicity across multiple iterations of response generation from ChatGPT were examined, and the perspectiveapi’s confidence was proven via extensive manual verification.

Implications

The research highlights substantial implications for the practical deployment of LLMs like ChatGPT:

Safety and Trustworthiness: The findings prompt urgent narratives surrounding the reevaluation of current safety measures implemented in LLMs. This paper advocates for the development of more robust, consistent safety guardrails.
Specification Sheets for AI Models: Drawing from safety parallels in other industries, the paper suggests the introduction of AI 'specification sheets,' including insights on potential biases and limitations. This could guide businesses leveraging AI in forecasting possible adverse outputs.
Broader Impact Considerations: The research calls for a deeper engagement with socio-technical paradigms influencing LLM deployment in sensitive sectors—asserting that technical fixes alone may not address underlying biases within AI systems.

Conclusions and Future Directions

This paper provides a critical analysis of how persona assignments impact the toxicity levels in ChatGPT's outputs, motivating a re-examination of AI LLM deployment. The authors underscore the need for systematic safety assessments that consider contextual and prompt-dependent variations in toxicity. Future work should explore the integration of diverse stakeholder feedback in refining training processes and strengthening the ethical deployment of AI, especially in conversational settings. Addressing these issues could lead to more equitable, less biased AI systems, fostering broader societal trust in these transformative technologies.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Ameet Deshpande (28 papers)
Vishvak Murahari (14 papers)
Tanmay Rajpurohit (16 papers)
Ashwin Kalyan (26 papers)
Karthik Narasimhan (82 papers)

Citations (284)

View on Semantic Scholar