Cross-lingual Offensive Language Detection: A Systematic Review of Datasets, Transfer Approaches and Challenges

Published 17 Jan 2024 in cs.CL | (2401.09244v1)

Abstract: The growing prevalence and rapid evolution of offensive language in social media amplify the complexities of detection, particularly highlighting the challenges in identifying such content across diverse languages. This survey presents a systematic and comprehensive exploration of Cross-Lingual Transfer Learning (CLTL) techniques in offensive language detection in social media. Our study stands as the first holistic overview to focus exclusively on the cross-lingual scenario in this domain. We analyse 67 relevant papers and categorise these studies across various dimensions, including the characteristics of multilingual datasets used, the cross-lingual resources employed, and the specific CLTL strategies implemented. According to "what to transfer", we also summarise three main CLTL transfer approaches: instance, feature, and parameter transfer. Additionally, we shed light on the current challenges and future research opportunities in this field. Furthermore, we have made our survey resources available online, including two comprehensive tables that provide accessible references to the multilingual datasets and CLTL methods used in the reviewed literature.

Abstract PDF HTML Upgrade to Chat

References (200)

Citations (2)

View on Semantic Scholar

Summary

The paper offers a systematic review of 67 studies, evaluating datasets, transfer techniques, and challenges in detecting offensive language across languages.
It categorizes transfer methods into instance, feature, and parameter levels, highlighting strategies like machine translation and zero-shot learning.
The survey identifies key challenges such as data scarcity, linguistic nuances, and annotation inconsistencies, suggesting paths for future research.

Introduction

Cross-Lingual Transfer Learning (CLTL) is an evolving subfield within the domain of offensive language detection on social media platforms. The challenge in this area is amplified by the need to identify offensive content that can vary significantly with linguistic nuances and cultural contexts. CLTL strategies are crucial in mitigating data scarcity issues encountered in low-resource languages. This survey inspects the techniques of cross-lingual detection by examining 67 studies, dissecting them based on datasets leveraged, resources applied, and the dimensions of transfer—instance, feature, and parameter.

Existing Datasets and Cross-Lingual Resources

Multilingual datasets serve as the foundation for cross-lingual studies, but they are often limited by factors such as data scarcity, linguistic diversity, and annotation challenges. This review has shed light on 82 multilingual datasets, noting varying topics, source platforms, language families, size, availability, and typologies of labels. The datasets are predominantly in Indo-European languages, followed by semitic languages like Arabic. In addition to existing datasets, cross-lingual resources such as multilingual lexicons, parallel corpora, and machine translation tools are critical to aligning linguistic features and facilitating CLTL.

Transfer Approaches in CLTL

The study categorizes CLTL approaches into three main levels:

Instance Transfer: This focuses on transferring data elements, such as text or labels, across languages, using techniques like annotation projection, pseudo-labelling, machine translation, and text alignment.
Feature Transfer: It leverages cross-lingual word embeddings and contextualized representations to maintain a shared feature space across languages. Retrofitting and additional features integration also fall within this spectrum.
Parameter Transfer: This level encompasses the transfer of model parameters or behaviors across languages. The paper breaks down parameter transfer into zero-shot, joint, and cascade learning scenarios accompanied by hybrid strategies like ensemble and meta-learning.

Challenges and Future Prospects

The myriad challenges identified pertain to linguistic structures, dataset limitations, and methodological hurdles. These include the adaptability problems posed by language-specific nuances, limited labelled datasets, inconsistent definitions and annotations, imbalance in datasets, and the limited generalization capabilities of CLTL models. Future research directions point towards creating balanced and comprehensive datasets, improving annotation strategies, integrating additional language-agnostic features, optimizing multilingual PLMs, and experimenting with advanced training strategies such as meta-learning and adversarial training.

This survey points to a need for a continued focus on CLTL in offensive language detection to bridge the language resource gap and enable robust moderation systems in multilingual online environments. It emphasizes the combined utility of multilingual datasets, cross-lingual resources, and innovative learning strategies while highlighting complexities that arise from cultural specificity and rapid linguistic evolution in digital communication.