MuTox: Universal MUltilingual Audio-based TOXicity Dataset and Zero-shot Detector (2401.05060v2)
Abstract: Research in toxicity detection in natural language processing for the speech modality (audio-based) is quite limited, particularly for languages other than English. To address these limitations and lay the groundwork for truly multilingual audio-based toxicity detection, we introduce MuTox, the first highly multilingual audio-based dataset with toxicity labels. The dataset comprises 20,000 audio utterances for English and Spanish, and 4,000 for the other 19 languages. To demonstrate the quality of this dataset, we trained the MuTox audio-based toxicity classifier, which enables zero-shot toxicity detection across a wide range of languages. This classifier outperforms existing text-based trainable classifiers by more than 1% AUC, while expanding the language coverage more than tenfold. When compared to a wordlist-based classifier that covers a similar number of languages, MuTox improves precision and recall by approximately 2.5 times. This significant improvement underscores the potential of MuTox in advancing the field of audio-based toxicity detection.
- Common voice: A massively-multilingual speech corpus. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4218–4222, Marseille, France. European Language Resources Association.
- Toxicity in multilingual machine translation at scale.
- Expressivity-aware SONAR: speech decoder.
- Detoxy: A large-scale multimodal dataset for toxicity classification in spoken utterances. In Interspeech.
- Kaggle. 2018. Toxic comment classification challenge. https://www.kaggle.com/c/jigsaw-toxic-comment- classification-challenge. Accessed: 2022-05-03.
- No language left behind: Scaling human-centered machine translation.
- Multilingual HateCheck: Functional tests for multilingual hate speech detection models. In Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH), pages 154–169, Seattle, Washington (Hybrid). Association for Computational Linguistics.
- Seamlessm4t-massively multilingual & multimodal machine translation.
- Seamless: Multilingual expressive and streaming speech translation.
- How ai is learning to identify toxic online content.
- Midia Yousefi and Dimitra Emmanouilidou. 2021. Audio-based toxic language classification using self-attentive convolutional neural network. In 29th European Signal Processing Conference, EUSIPCO 2021, Dublin, Ireland, August 23-27, 2021, pages 11–15. IEEE.
- Marta R. Costa-jussà (73 papers)
- Mariano Coria Meglioli (5 papers)
- Pierre Andrews (13 papers)
- David Dale (18 papers)
- Prangthip Hansanti (9 papers)
- Elahe Kalbassi (7 papers)
- Alex Mourachko (4 papers)
- Christophe Ropers (18 papers)
- Carleigh Wood (10 papers)