Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

From One to Many: Expanding the Scope of Toxicity Mitigation in Language Models (2403.03893v3)

Published 6 Mar 2024 in cs.CL and cs.AI

Abstract: To date, toxicity mitigation in LLMs has almost entirely been focused on single-language settings. As LLMs embrace multilingual capabilities, it's crucial our safety measures keep pace. Recognizing this research gap, our approach expands the scope of conventional toxicity mitigation to address the complexities presented by multiple languages. In the absence of sufficient annotated datasets across languages, we employ translated data to evaluate and enhance our mitigation techniques. We also compare finetuning mitigation approaches against retrieval-augmented techniques under both static and continual toxicity mitigation scenarios. This allows us to examine the effects of translation quality and the cross-lingual transfer on toxicity mitigation. We also explore how model size and data quantity affect the success of these mitigation efforts. Covering nine languages, our study represents a broad array of linguistic families and levels of resource availability, ranging from high to mid-resource languages. Through comprehensive experiments, we provide insights into the complexities of multilingual toxicity mitigation, offering valuable insights and paving the way for future research in this increasingly important field. Code and data are available at https://github.com/for-ai/goodtriever.

Exploring Multilingual Toxicity Mitigation in LLMs

Introduction

The rapid adoption of LLMs across various applications has illuminated the profound impact of their multilingual capabilities. However, this linguistically diverse application of LLMs amplifies the need for robust toxicity mitigation techniques that transcend the English language to ensure global usability and safety. The research presented explores the complexities of implementing such multilingual toxicity mitigation techniques. It evaluates the effectiveness of translated data versus in-language data, compares retrieval-augmented techniques against finetuning approaches, and investigates the scalability and versatility of these mitigation strategies across multiple languages.

Mitigation Techniques

The two primary toxicity mitigation techniques examined are DExperts, a finetuning-based method, and Goodtriever, a retrieval-augmented approach. Both techniques utilize a baseline mGPT model, varying in size from 1.3B to 13B parameters, and are tested against a spectrum of nine languages. This broad linguistic range includes high-resource languages (e.g., English, Russian, Italian, French, Portuguese, and Spanish) and mid-resource languages (e.g., Arabic, Hindi, Korean), spread across five distinct scripts.

Datasets and Evaluation

The paper extends established datasets by incorporating translated variants, aiming to address the gap left by the scarcity of in-language toxicity annotation for many languages. For evaluation, the research employs a set of standardized prompts derived from the HolisticBias dataset, translated into the languages of interest. This aids in the consistent assessment of toxicity across languages despite inherent challenges such as cultural nuances and translation inaccuracies.

Findings

A key finding is the surprising efficacy of translated data in reducing toxicity, often surpassing the results of in-language datasets. This phenomenon is observed across both high and mid-resource languages, suggesting that despite potential losses in translation, the core toxicological cues are preserved and effectively mitigated. Further, the retrieval-based Goodtriever method consistently outperforms the finetuning-based DExperts, especially in scenarios involving mid-resource languages or more complex multilingual settings.

Future Directions

The paper sheds light on the importance of continually evolving toxicity mitigation techniques to accommodate the dynamic nature of language and the diversifying spectrum of users engaging with LLMs. It underscores the need for further research into developing more nuanced and culturally sensitive evaluation frameworks that better reflect the multilingual and multicultural reality of global LLM deployment.

Implications

This research marks a pivotal step towards understanding and implementing multilingual toxicity mitigation in LLMs. It paves the way for future explorations into scalable, effective methods that ensure safer, more inclusive language technologies. By demonstrating the potential of both translated data and retrieval-augmented techniques, the paper offers valuable insights for developers and researchers aiming to enhance the global usability and safety of LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (63)
  1. Cross-lingual word embeddings for low-resource language modeling. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pp.  937–947, 2017.
  2. Assessing reference-free peer evaluation for machine translation. arXiv preprint arXiv:2104.05146, 2021.
  3. Towards accurate detection of offensive language in online communication in arabic. Procedia computer science, 142:315–320, 2018.
  4. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
  5. On the cross-lingual transferability of monolingual representations. arXiv preprint arXiv:1910.11856, 2019.
  6. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pp.  610–623, 2021.
  7. Nuanced metrics for measuring unintended bias with real data for text classification. In Companion Proceedings of The 2019 World Wide Web Conference, 2019.
  8. Patterns of implicit and explicit stereotypes iii: Long-term change in gender stereotypes. Social Psychological and Personality Science, 13(1):14–26, 2022.
  9. No language left behind: Scaling human-centered machine translation. arXiv preprint arXiv:2207.04672, 2022a.
  10. Toxicity in multilingual machine translation at scale. arXiv preprint arXiv:2210.03070, 2022b.
  11. Multilingual holistic bias: Extending descriptors and patterns to unveil demographic biases in languages at scale. arXiv preprint arXiv:2305.13198, 2023a.
  12. Added toxicity mitigation at inference time for multimodal and massively multilingual translation. arXiv preprint arXiv:2311.06532, 2023b.
  13. Text detoxification using large pre-trained neural models. arXiv preprint arXiv:2109.08914, 2021.
  14. Plug and play language models: A simple approach to controlled text generation. arXiv preprint arXiv:1912.02164, 2019.
  15. Multilingual jailbreak challenges in large language models. arXiv preprint arXiv:2310.06474, 2023.
  16. Toxicity in chatgpt: Analyzing persona-assigned language models. arXiv preprint arXiv:2304.05335, 2023.
  17. Nl-augmenter: A framework for task-sensitive natural language augmentation. arXiv preprint arXiv:2112.02721, 2021.
  18. Americasnli: Evaluating zero-shot natural language understanding of pretrained multilingual models in truly low-resource languages. arXiv preprint arXiv:2104.08726, 2021.
  19. Beyond english-centric multilingual machine translation. Journal of Machine Learning Research, 22(107):1–48, 2021.
  20. Realtoxicityprompts: Evaluating neural toxic degeneration in language models. arXiv preprint arXiv:2009.11462, 2020.
  21. Don’t stop pretraining: Adapt language models to domains and tasks. arXiv preprint arXiv:2004.10964, 2020.
  22. Geoffrey E Hinton. Training products of experts by minimizing contrastive divergence. Neural computation, 14(8):1771–1800, 2002.
  23. The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751, 2019.
  24. Xtreme: A massively multilingual multi-task benchmark for evaluating cross-lingual generalisation. In International Conference on Machine Learning, pp.  4411–4421. PMLR, 2020.
  25. Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299, 2022.
  26. Development of a speech recognition system for icelandic using machine translated text. In Spoken Languages Technologies for Under-Resourced Languages, 2008.
  27. Unsung challenges of building and deploying language technologies for low resource language communities. arXiv preprint arXiv:1912.03457, 2019.
  28. Generalization through memorization: Nearest neighbor language models. arXiv preprint arXiv:1911.00172, 2019.
  29. Should we translate? evaluating toxicity in online comments when translating from portuguese to english. In Proceedings of the Brazilian Symposium on Multimedia and the Web, pp.  89–98, 2022.
  30. Openassistant conversations–democratizing large language model alignment. arXiv preprint arXiv:2304.07327, 2023.
  31. Gedi: Generative discriminator guided sequence generation. arXiv preprint arXiv:2009.06367, 2020.
  32. Pre-trained multilingual sequence-to-sequence models: A hope for low-resource language translation? arXiv preprint arXiv:2203.08850, 2022.
  33. A new generation of perspective api: Efficient multilingual character-level transformers. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp.  3197–3207, 2022.
  34. A diversity-promoting objective function for neural conversation models. arXiv preprint arXiv:1510.03055, 2015.
  35. Few-shot learning with multilingual language models. arXiv preprint arXiv:2112.10668, 2021.
  36. Dexperts: Decoding-time controlled text generation with experts and anti-experts. arXiv preprint arXiv:2105.03023, 2021.
  37. Summary of chatgpt/gpt-4 research and perspective towards the future of large language models. arXiv preprint arXiv:2304.01852, 2023.
  38. Are gender stereotypes changing over time? a cross-temporal analysis of perceptions about gender stereotypes in spain (?‘ están cambiando los estereotipos de género con el tiempo? un análisis transtemporal de las percepciones sobre los estereotipos de género en españa). International Journal of Social Psychology, 36(2):330–354, 2021.
  39. Overview of the hasoc track at fire 2019: Hate speech and offensive content identification in indo-european languages. In Proceedings of the 11th annual meeting of the Forum for Information Retrieval Evaluation, pp.  14–17, 2019.
  40. How translation alters sentiment. Journal of Artificial Intelligence Research, 55:95–130, 2016.
  41. Toxic bias: Perspective api misreads german as more toxic. arXiv preprint arXiv:2312.12651, 2023.
  42. OpenAI. Introducing chatgpt. https://openai.com/blog/chatgpt, 2022. Accessed: 2023-06-13.
  43. Irina Ovchinnikova. Impact of new technologies on the types of translation errors. In CEUR Workshop Proceedings, 2020.
  44. Maja Popović. chrf++: words helping character n-grams. In Proceedings of the second conference on machine translation, pp.  612–618, 2017.
  45. On the challenges of using black-box apis for toxicity evaluation in research. arXiv preprint arXiv:2304.12397, 2023a.
  46. Goodtriever: Adaptive toxicity mitigation with retrieval-augmented models. arXiv preprint arXiv:2310.07589, 2023b.
  47. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446, 2021.
  48. Data augmentation for low resource languages. In INTERSPEECH 2014: 15th Annual Conference of the International Speech Communication Association, pp.  810–814. International Speech Communication Association (ISCA), 2014.
  49. Towards red teaming in multimodal and multilingual translation, 2024.
  50. What makes a good conversation? how controllable attributes affect human judgments. arXiv preprint arXiv:1902.08654, 2019.
  51. The language barrier: Dissecting safety challenges of llms in multilingual contexts. arXiv preprint arXiv:2401.13136, 2024.
  52. The woman worked as a babysitter: On biases in language generation. arXiv preprint arXiv:1909.01326, 2019.
  53. Towards controllable biases in language generation. arXiv preprint arXiv:2005.00268, 2020.
  54. mgpt: Few-shot learners go multilingual. arXiv preprint arXiv:2204.07580, 2022.
  55. Catriona Silvey. Speaking our minds: Why human communication is different, and how language evolved to make it special, by thom scott-phillips, 2016.
  56. Aya dataset: An open-access collection for multilingual instruction tuning, 2024.
  57. " i’m sorry to hear that": finding bias in language models with a holistic descriptor dataset. arXiv preprint arXiv:2205.09209, 2022.
  58. You reap what you sow: On the challenges of bias evaluation under multilingual settings. In Proceedings of BigScience Episode# 5–Workshop on Challenges & Perspectives in Creating Large Language Models, pp.  26–41, 2022.
  59. Aya model: An instruction finetuned open-access multilingual language model, 2024.
  60. Learning from the worst: Dynamically generated datasets to improve online hate detection. arXiv preprint arXiv:2012.15761, 2020.
  61. Exploring the limits of domain-adaptive training for detoxifying large-scale language models. arXiv preprint arXiv:2202.04173, 2022.
  62. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, 2019.
  63. Low-resource languages jailbreak gpt-4. arXiv preprint arXiv:2310.02446, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Luiza Pozzobon (5 papers)
  2. Patrick Lewis (37 papers)
  3. Sara Hooker (71 papers)
  4. Beyza Ermis (31 papers)
Citations (6)