RTP-LX: Can LLMs Evaluate Toxicity in Multilingual Scenarios? (2404.14397v2)
Abstract: LLMs and small LLMs (SLMs) are being adopted at remarkable speed, although their safety still remains a serious concern. With the advent of multilingual S/LLMs, the question now becomes a matter of scale: can we expand multilingual safety evaluations of these models with the same velocity at which they are deployed? To this end, we introduce RTP-LX, a human-transcreated and human-annotated corpus of toxic prompts and outputs in 28 languages. RTP-LX follows participatory design practices, and a portion of the corpus is especially designed to detect culturally-specific toxic language. We evaluate 10 S/LLMs on their ability to detect toxic content in a culturally-sensitive, multilingual scenario. We find that, although they typically score acceptably in terms of accuracy, they have low agreement with human judges when scoring holistically the toxicity of a prompt; and have difficulty discerning harm in context-dependent scenarios, particularly with subtle-yet-harmful content (e.g. microaggressions, bias). We release this dataset to contribute to further reduce harmful uses of these models and improve their safe deployment.
- Adrian de Wynter (20 papers)
- Ishaan Watts (4 papers)
- Nektar Ege Altıntoprak (1 paper)
- Tua Wongsangaroonsri (1 paper)
- Minghui Zhang (42 papers)
- Noura Farra (6 papers)
- Lena Baur (1 paper)
- Samantha Claudet (1 paper)
- Pavel Gajdusek (1 paper)
- Can Gören (1 paper)
- Qilong Gu (8 papers)
- Anna Kaminska (15 papers)
- Ruby Kuo (1 paper)
- Akiko Kyuba (1 paper)
- Jongho Lee (38 papers)
- Kartik Mathur (1 paper)
- Petter Merok (1 paper)
- Nani Paananen (1 paper)
- Vesa-Matti Paananen (1 paper)
- Anna Pavlenko (4 papers)