Exploring Methods for Cross-lingual Text Style Transfer: The Case of Text Detoxification (2311.13937v1)
Abstract: Text detoxification is the task of transferring the style of text from toxic to neutral. While here are approaches yielding promising results in monolingual setup, e.g., (Dale et al., 2021; Hallinan et al., 2022), cross-lingual transfer for this task remains a challenging open problem (Moskovskiy et al., 2022). In this work, we present a large-scale study of strategies for cross-lingual text detoxification -- given a parallel detoxification corpus for one language; the goal is to transfer detoxification ability to another language for which we do not have such a corpus. Moreover, we are the first to explore a new task where text translation and detoxification are performed simultaneously, providing several strong baselines for this task. Finally, we introduce new automatic detoxification evaluation metrics with higher correlations with human judgments than previous benchmarks. We assess the most promising approaches also with manual markup, determining the answer for the best strategy to transfer the knowledge of text detoxification between languages.
- A large-scale computational study of content preservation measures for text style transfer and paraphrase generation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, pages 300–321, Dublin, Ireland. Association for Computational Linguistics.
- Olá, bonjour, salve! XFORMAL: A benchmark for multilingual formality style transfer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3199–3216, Online. Association for Computational Linguistics.
- No language left behind: Scaling human-centered machine translation. arXiv e-prints, pages arXiv–2207.
- Mathias Creutz. 2018. Open subtitles paraphrase corpus for six languages. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation, LREC 2018, Miyazaki, Japan, May 7-12, 2018. European Language Resources Association (ELRA).
- Text detoxification using large pre-trained neural models. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7979–7996, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- RUSSE-2022: Findings of the first Russian detoxification task based on parallel corpora. In Computational Linguistics and Intellectual Technologies.
- Methods for detoxification of texts for the russian language. Multimodal Technol. Interact., 5(9):54.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
- Fighting offensive language on social media with unsupervised text style transfer. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 2: Short Papers, pages 189–194. Association for Computational Linguistics.
- A multilingual offensive language detection method based on transfer learning from transformer fine-tuning model. Journal of King Saud University-Computer and Information Sciences, 34(8):6048–6056.
- Beyond english-centric multilingual machine translation. J. Mach. Learn. Res., 22:107:1–107:48.
- Language-agnostic BERT sentence embedding. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 878–891. Association for Computational Linguistics.
- Automatically ranked Russian paraphrase corpus for text generation. In Proceedings of the Fourth Workshop on Neural Generation and Translation, pages 54–59, Online. Association for Computational Linguistics.
- Ilya Gusev. 2022. Russian texts detoxification with levenshtein editing. CoRR, abs/2204.13638.
- Detoxifying text with marco: Controllable revision with experts and anti-experts. CoRR, abs/2212.10543.
- Crosssum: Beyond english-centric cross-lingual abstractive text summarization for 1500+ language pairs. CoRR, abs/2112.08804.
- Parameter-efficient transfer learning for NLP. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pages 2790–2799. PMLR.
- Jigsaw. 2018. Toxic comment classification challenge. https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge. Accessed: 2021-03-01.
- Jigsaw. 2020. Jigsaw multilingual toxic comment classification. https://www.kaggle.com/c/jigsaw-multilingual-toxic-comment-classification. Accessed: 2021-03-01.
- Disentangled representation learning for non-parallel text style transfer. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 424–434. Association for Computational Linguistics.
- Few-shot controllable style transfer for low-resource multilingual settings. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 7439–7468. Association for Computational Linguistics.
- Yuri Kuratov and Mikhail Arkhipov. 2019. Adaptation of deep bidirectional multilingual transformers for russian language. CoRR, abs/1905.07213.
- Human judgement as a compass to navigate automatic metrics for formality transfer. In Proceedings of the 2nd Workshop on Human Evaluation of NLP Systems (HumEval), pages 102–115, Dublin, Ireland. Association for Computational Linguistics.
- Multilingual pre-training with language and task adaptation for multilingual text style transfer. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 262–271. Association for Computational Linguistics.
- BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880, Online. Association for Computational Linguistics.
- Pierre Lison and Jörg Tiedemann. 2016. Opensubtitles2016: Extracting large parallel corpora from movie and TV subtitles. In Proceedings of the Tenth International Conference on Language Resources and Evaluation LREC 2016, Portorož, Slovenia, May 23-28, 2016. European Language Resources Association (ELRA).
- Roberta: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692.
- A study on manual and automatic evaluation for text style transfer: The case of detoxification. In Proceedings of the 2nd Workshop on Human Evaluation of NLP Systems (HumEval), pages 90–101, Dublin, Ireland. Association for Computational Linguistics.
- ParaDetox: Detoxification with parallel data. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6804–6818, Dublin, Ireland. Association for Computational Linguistics.
- Rupaws: A russian adversarial dataset for paraphrase identification. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, LREC 2022, Marseille, France, 20-25 June 2022, pages 5683–5691. European Language Resources Association.
- Rucola: Russian corpus of linguistic acceptability. CoRR, abs/2210.12814.
- Beep! korean corpus of online news comments for toxic speech detection. In Proceedings of the Eighth International Workshop on Natural Language Processing for Social Media, SocialNLP@ACL 2020, Online, July 10, 2020, pages 25–31. Association for Computational Linguistics.
- Exploring cross-lingual text detoxification with large multilingual language models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, pages 346–354, Dublin, Ireland. Association for Computational Linguistics.
- Facebook fair’s WMT19 news translation task submission. In Proceedings of the Fourth Conference on Machine Translation, WMT 2019, Florence, Italy, August 1-2, 2019 - Volume 2: Shared Task Papers, Day 1, pages 314–319. Association for Computational Linguistics.
- A call for standardization and validation of text style transfer evaluation. CoRR, abs/2306.00539.
- Laura Perez-Beltrachini and Mirella Lapata. 2021. Models and datasets for cross-lingual summarisation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pages 9408–9423. Association for Computational Linguistics.
- Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1–140:67.
- Sudha Rao and Joel Tetreault. 2018. Dear sir or madam, may I introduce the GYAFC dataset: Corpus, benchmarks and metrics for formality style transfer. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 129–140, New Orleans, Louisiana. Association for Computational Linguistics.
- BLEURT: Learning robust metrics for text generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7881–7892, Online. Association for Computational Linguistics.
- Aleksandr Semiletov. 2020. Toxic russian comments. https://www.kaggle.com/alexandersemiletov/toxic-russian-comments. Accessed: 2021-07-22.
- Multilingual translation with extensible multilingual pretraining and finetuning.
- Jörg Tiedemann. 2020. The tatoeba translation challenge - realistic data sets for low resource and multilingual MT. In Proceedings of the Fifth Conference on Machine Translation, WMT@EMNLP 2020, Online, November 19-20, 2020, pages 1174–1182. Association for Computational Linguistics.
- Jörg Tiedemann and Santhosh Thottingal. 2020. OPUS-MT - building open translation services for the world. In Proceedings of the 22nd Annual Conference of the European Association for Machine Translation, EAMT 2020, Lisboa, Portugal, November 3-5, 2020, pages 479–480. European Association for Machine Translation.
- Towards a friendly online community: An unsupervised style transfer framework for profanity redaction. In Proceedings of the 28th International Conference on Computational Linguistics, pages 2107–2114, Barcelona, Spain (Online). International Committee on Computational Linguistics.
- Natalia Vanetik and Elisheva Mimoun. 2022. Detection of racist language in french tweets. Inf., 13(7):318.
- Deep-bert: Transfer learning for classifying multilingual offensive texts on social media. Comput. Syst. Sci. Eng., 44(2):1775–1791.
- Neural network acceptability judgments. Trans. Assoc. Comput. Linguistics, 7:625–641.
- Beyond BLEU:training neural machine translation with semantic similarity. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4344–4355, Florence, Italy. Association for Computational Linguistics.
- John Wieting and Kevin Gimpel. 2018. ParaNMT-50M: Pushing the limits of paraphrastic sentence embeddings with millions of machine translations. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 451–462, Melbourne, Australia. Association for Computational Linguistics.
- "mask and infill": Applying masked language model to sentiment transfer. CoRR, abs/1908.08039.
- mt5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021, pages 483–498. Association for Computational Linguistics.
- Daryna Dementieva (20 papers)
- Daniil Moskovskiy (9 papers)
- David Dale (18 papers)
- Alexander Panchenko (92 papers)