Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Realistic Evaluation of Toxicity in Large Language Models (2405.10659v2)

Published 17 May 2024 in cs.CL and cs.AI
Realistic Evaluation of Toxicity in Large Language Models

Abstract: LLMs have become integral to our professional workflows and daily lives. Nevertheless, these machine companions of ours have a critical flaw: the huge amount of data which endows them with vast and diverse knowledge, also exposes them to the inevitable toxicity and bias. While most LLMs incorporate defense mechanisms to prevent the generation of harmful content, these safeguards can be easily bypassed with minimal prompt engineering. In this paper, we introduce the new Thoroughly Engineered Toxicity (TET) dataset, comprising manually crafted prompts designed to nullify the protective layers of such models. Through extensive evaluations, we demonstrate the pivotal role of TET in providing a rigorous benchmark for evaluation of toxicity awareness in several popular LLMs: it highlights the toxicity in the LLMs that might remain hidden when using normal prompts, thus revealing subtler issues in their behavior.

Realistic Evaluation of Toxicity in LLMs

The paper, "Realistic Evaluation of Toxicity in LLMs," provides an in-depth examination of the propensity of LLMs to generate toxic content, emphasizing the limitations of conventional toxicity assessments and introducing a novel dataset, the Thoroughly Engineered Toxicity (TET) dataset. The paper underscores the importance of realistic prompt scenarios for evaluating the safety of LLMs and examines the vulnerabilities of these models to engineered prompts intended to trigger toxic responses.

Key Contributions

The primary contribution of the paper is the introduction of the TET dataset, which aggregates prompts from over one million real-world interactions with 25 distinct LLMs. These prompts are particularly crafted to assess models in realistic and jailbreak scenarios, thereby providing a more accurate representation of how these models might be exploited in practical settings. Unlike existing datasets like RealToxicityPrompts and ToxiGen, TET incorporates a broader spectrum of interaction contexts, particularly those crafted to bypass protective mechanisms commonly embedded in LLMs.

Additionally, the authors conduct comparative analyses of various LLMs using the TET dataset, which is complemented by toxicity classification tools such as HateBERT and the Perspective API. This leads to a systematic assessment of how LLMs perform under exposure to prompts of varying toxicity levels.

Experimental Design and Findings

The experimental design involves evaluating responses from multiple LLMs — including ChatGPT, Llama2, and others — to the TET prompts using the Perspective API. The toxicity is measured across six dimensions: toxicity, severe toxicity, identity attack, insult, profanity, and threat. Among these, Llama2-70B-Chat exhibits a notably lower overall toxicity score (17.901), highlighting its relative strength in minimizing toxic outputs. Conversely, models like Mistral-7B-v0.1 and Zephyr-7B-β demonstrate higher susceptibility to generating toxic content, indicating areas for improvement in LLM development.

Furthermore, the paper provides quantitative evidence that TET prompts elicit significantly more toxic responses from the LLMs compared to the ToxiGen dataset, underlining the effectiveness of TET in revealing potential risks associated with LLM usage. Specifically, Table 2 delineates these results, showcasing higher toxicity scores from TET even when prompt toxicity levels are similar.

Implications and Future Directions

This research has pivotal implications for both the academic and practical development of AI models. Practically, it challenges developers to enhance the robustness of LLMs against adversarial prompts that may not only exploit but potentially propagate harmful content. Theoretically, the paper provides a benchmark for evaluating LLM safety, pushing the boundaries of conventional toxicity detection approaches by incorporating realistic and creative prompt scenarios.

Future research could extend this work by incorporating a broader array of conversational contexts and expanding evaluations to include an even wider variety of model architectures. Moreover, further exploration is needed to better understand the nuanced interactions between different prompt templates and model responses, particularly in developing comprehensive defense mechanisms against both explicit and implicit bias or toxicity in generated content.

Limitations

While comprehensive, the paper recognizes certain limitations. The paper acknowledges the need for future work to incorporate conversations within safety assessments fully and notes restrictions due to computational resources which limit the benchmarking of larger, widely-used models. There is also a recognition of the need for ongoing updates to the findings as models evolve and as the landscape of real-world applications and interactions continue to change.

The introduction of the TET dataset marks an important advancement in the rigorous assessment of LLM safety, setting a new standard for future research dedicated to the responsible evolution of AI technologies. This work stands as a testament to the evolving nature of AI development and the ongoing need to address ethical considerations through empirical rigor and methodical evaluation processes.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (21)
  1. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  2. Hatebert: Retraining bert for abusive language detection in english. arXiv preprint arXiv:2010.12472.
  3. A survey on evaluation of large language models. arXiv preprint arXiv:2307.03109.
  4. Toxicity in chatgpt: Analyzing persona-assigned language models. arXiv preprint arXiv:2304.05335.
  5. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  6. Realtoxicityprompts: Evaluating neural toxic degeneration in language models. arXiv preprint arXiv:2009.11462.
  7. Aaron Gokaslan and Vanya Cohen. 2019. Openwebtext corpus. http://Skylion007.github.io/OpenWebTextCorpus.
  8. Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. arXiv preprint arXiv:2203.09509.
  9. Llama guard: Llm-based input-output safeguard for human-ai conversations. arXiv preprint arXiv:2312.06674.
  10. Mistral 7b. arXiv preprint arXiv:2310.06825.
  11. Jailbreaking chatgpt via prompt engineering: An empirical study. arXiv preprint arXiv:2305.13860.
  12. Orca 2: Teaching small language models how to reason.
  13. Orca: Progressive learning from complex explanation traces of gpt-4. arXiv preprint arXiv:2306.02707.
  14. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
  15. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  16. Zephyr: Direct distillation of lm alignment.
  17. Openchat: Advancing open-source language models with mixed-quality data. arXiv preprint arXiv:2309.11235.
  18. Pandalm: An automatic evaluation benchmark for llm instruction tuning optimization. arXiv preprint arXiv:2306.05087.
  19. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771.
  20. Lmsys-chat-1m: A large-scale real-world llm conversation dataset.
  21. Exploring ai ethics of chatgpt: A diagnostic analysis. arXiv preprint arXiv:2301.12867.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Tinh Son Luong (2 papers)
  2. Thanh-Thien Le (6 papers)
  3. Linh Ngo Van (12 papers)
  4. Thien Huu Nguyen (61 papers)
Citations (1)
X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com