Cross-lingual Offensive Language Detection: A Systematic Review of Datasets, Transfer Approaches and Challenges

Published 17 Jan 2024 in cs.CL | (2401.09244v1)

Abstract: The growing prevalence and rapid evolution of offensive language in social media amplify the complexities of detection, particularly highlighting the challenges in identifying such content across diverse languages. This survey presents a systematic and comprehensive exploration of Cross-Lingual Transfer Learning (CLTL) techniques in offensive language detection in social media. Our study stands as the first holistic overview to focus exclusively on the cross-lingual scenario in this domain. We analyse 67 relevant papers and categorise these studies across various dimensions, including the characteristics of multilingual datasets used, the cross-lingual resources employed, and the specific CLTL strategies implemented. According to "what to transfer", we also summarise three main CLTL transfer approaches: instance, feature, and parameter transfer. Additionally, we shed light on the current challenges and future research opportunities in this field. Furthermore, we have made our survey resources available online, including two comprehensive tables that provide accessible references to the multilingual datasets and CLTL methods used in the reviewed literature.

Abstract PDF HTML Upgrade to Chat

Authors (2)

References (200)

Citations (2)

View on Semantic Scholar

Summary

The paper offers a systematic review of 67 studies, evaluating datasets, transfer techniques, and challenges in detecting offensive language across languages.
It categorizes transfer methods into instance, feature, and parameter levels, highlighting strategies like machine translation and zero-shot learning.
The survey identifies key challenges such as data scarcity, linguistic nuances, and annotation inconsistencies, suggesting paths for future research.

Introduction

Cross-Lingual Transfer Learning (CLTL) is an evolving subfield within the domain of offensive language detection on social media platforms. The challenge in this area is amplified by the need to identify offensive content that can vary significantly with linguistic nuances and cultural contexts. CLTL strategies are crucial in mitigating data scarcity issues encountered in low-resource languages. This survey inspects the techniques of cross-lingual detection by examining 67 studies, dissecting them based on datasets leveraged, resources applied, and the dimensions of transfer—instance, feature, and parameter.

Existing Datasets and Cross-Lingual Resources

Multilingual datasets serve as the foundation for cross-lingual studies, but they are often limited by factors such as data scarcity, linguistic diversity, and annotation challenges. This review has shed light on 82 multilingual datasets, noting varying topics, source platforms, language families, size, availability, and typologies of labels. The datasets are predominantly in Indo-European languages, followed by semitic languages like Arabic. In addition to existing datasets, cross-lingual resources such as multilingual lexicons, parallel corpora, and machine translation tools are critical to aligning linguistic features and facilitating CLTL.

Transfer Approaches in CLTL

The study categorizes CLTL approaches into three main levels:

Instance Transfer: This focuses on transferring data elements, such as text or labels, across languages, using techniques like annotation projection, pseudo-labelling, machine translation, and text alignment.
Feature Transfer: It leverages cross-lingual word embeddings and contextualized representations to maintain a shared feature space across languages. Retrofitting and additional features integration also fall within this spectrum.
Parameter Transfer: This level encompasses the transfer of model parameters or behaviors across languages. The paper breaks down parameter transfer into zero-shot, joint, and cascade learning scenarios accompanied by hybrid strategies like ensemble and meta-learning.

Challenges and Future Prospects

The myriad challenges identified pertain to linguistic structures, dataset limitations, and methodological hurdles. These include the adaptability problems posed by language-specific nuances, limited labelled datasets, inconsistent definitions and annotations, imbalance in datasets, and the limited generalization capabilities of CLTL models. Future research directions point towards creating balanced and comprehensive datasets, improving annotation strategies, integrating additional language-agnostic features, optimizing multilingual PLMs, and experimenting with advanced training strategies such as meta-learning and adversarial training.

This survey points to a need for a continued focus on CLTL in offensive language detection to bridge the language resource gap and enable robust moderation systems in multilingual online environments. It emphasizes the combined utility of multilingual datasets, cross-lingual resources, and innovative learning strategies while highlighting complexities that arise from cultural specificity and rapid linguistic evolution in digital communication.

Markdown Report Issue

Paper to Video (Beta)

All Videos Subscribe on YouTube

Whiteboard

Cross-lingual Offensive Language Detection: A Systematic Review of Datasets, Transfer Approaches and Challenges

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

Cross‑lingual Offensive Language Detection: A Teen‑Friendly Guide

What is this paper about?

This paper is a big “map of the field” that looks at how computers can spot rude, harmful, or hateful messages on social media in many different languages—not just English. It reviews 67 research papers to explain what data people use, how they move knowledge from one language to another, what tools help, and what problems still need solving.

What questions were the authors trying to answer?

The authors set out to answer, in simple terms:

What datasets and languages are used to detect offensive or hateful posts across the world?
How can we “transfer” what machines learn in one language (like English) to help in another language (like Hindi or Arabic), especially when there isn’t much labeled data?
Which strategies work best for cross‑language transfer, and what are their pros and cons?
What are the biggest challenges (like slang, culture, and translation errors), and where should future research go?

How did they study it?

The authors did a systematic review—think of it like carefully collecting and organizing all the studies on this topic to see the big picture.

They searched scholarly databases (like Google Scholar and the ACL Anthology) using many keyword combinations about hate/offense and multi‑language learning.
They followed a structured process (called PRISMA) to include only relevant studies and ended up with 67 papers to analyze.
They summarized:
- The datasets used (82 in total), including which languages and platforms they came from.
- The tools and resources that help cross languages (like translation systems, dictionaries, and multilingual models).
- The main transfer learning approaches.

Everyday analogy: Imagine you’re learning to bake cookies. You read lots of recipes from different countries, compare ingredients (datasets), cooking tools (resources like translation), and techniques (transfer approaches). Then you write a guide that tells others what works best when baking in different kitchens (languages).

Key ideas explained simply

Before the findings, here are a few important terms:

Cross‑lingual: Using what we learned in one language to help in another. Like learning math in English and then solving problems in Spanish because the core ideas transfer.
Multilingual: Involving many languages at the same time.
Code‑mixing: When people mix languages in the same sentence (e.g., “Vamos to the park!”).
Transfer learning: Teaching a model a skill once and reusing that skill in a new setting, with little extra training.
Pre‑trained LLMs (PLMs): Very large language “brains” (like mBERT or XLM‑R) that have read lots of text and can be fine‑tuned for tasks like detecting hateful content.

The three main ways knowledge is transferred across languages

The paper groups cross‑language strategies into three easy‑to‑grasp types:

Instance transfer (data-level)
- You move examples across languages—like translating English tweets into Spanish so a Spanish model can learn from them.
- Helpful when a target language has almost no labeled examples.
Feature transfer (representation-level)
- You teach the model to see different languages in a shared “feature space,” so similar meanings end up close together.
- Analogy: A color wheel for languages—“red” in English and “rojo” in Spanish point to the same spot on the wheel.
Parameter transfer (model-level)
- You reuse parts of a trained model (its “settings” or “weights”) in a new language.
- Analogy: If you’ve learned to ride a bike, switching to a similar bike takes little adjustment; you keep your balance skills (model parameters) and just fine‑tune.

These approaches can be mixed—for example, using translated data (instance transfer) with a multilingual model (parameter transfer).

What did they find, and why does it matter?

Here are the key takeaways, kept simple:

A growing and global field:
- Research on cross‑lingual detection has grown fast since 2018.
- The review covers 67 papers and 82 datasets across 30+ languages.
Data often comes from social media:
- Twitter is the most common source (almost half of all datasets), followed by YouTube and Facebook.
- Many datasets focus on “offensive language” or “hate speech,” but some target specific issues like sexism or racism.
English dominates, but many languages are under‑served:
- English is heavily represented; many other languages (especially low‑resource ones) have far less labeled data.
- That’s why cross‑lingual transfer is so important: it helps build systems for languages with little data.
Labels vary:
- Most datasets use simple labels like “offensive” vs. “not offensive,” but some use finer categories (target of attack, severity levels).
- This variety makes it harder to combine datasets across languages.
Useful tools and resources:
- Multilingual dictionaries and parallel texts (the same sentence in two languages) help align meaning.
- Translation tools (e.g., Google Translate, DeepL) help create training data—but they can mess up slang, sarcasm, or cultural slurs.
- Multilingual models (like mBERT and XLM‑R) are powerful for parameter transfer.
Challenges to watch out for:
- Translation errors and cultural differences: A phrase that’s clearly hateful in one language might be mild or meaningless in another.
- Code‑mixing: Mixed‑language posts are harder to analyze.
- Bias: If training data is unbalanced, models can unfairly flag certain groups or dialects.
- Small datasets: Many datasets are tiny, which makes it tough to train reliable models—especially for low‑resource languages.
Practical contribution:
- The authors created online tables listing datasets and methods to help other researchers find and compare resources quickly.

Why it matters: Better cross‑lingual methods mean we can help keep online spaces safer in many languages—not just where there’s lots of data. That’s important for global platforms like YouTube, Instagram, and TikTok.

What’s the potential impact?

Fairer and wider coverage: Cross‑lingual transfer can bring strong moderation tools to languages that are often overlooked, helping protect more people online.
Faster responses to new trends: Harmful language evolves quickly. Transfer methods can adapt patterns learned in one language to others more rapidly.
Better tools and standards: By comparing methods and datasets, researchers can build fairer, more reliable systems and set better guidelines.
Future directions: The field needs better translations for slang/offense, more code‑mixing support, more balanced datasets across cultures, and careful attention to bias and ethics.

Short recap

Purpose: Review how machines learn to detect offensive/hateful language across different languages.
Method: A careful survey of 67 papers using a structured process.
Big idea: Use what works in one language to help another (three main transfer types: instance, feature, parameter).
Main findings: Lots of English data, heavy Twitter use, growing interest since 2018, strong multilingual models, and real challenges with culture, translation, and bias.
Impact: Safer global social media through better, fairer, and more widely usable detection tools.

View Paper Prompt View All Prompts

Knowledge Gaps

Unresolved gaps, limitations, and open questions

Below is a concise, actionable list of what remains missing, uncertain, or unexplored based on the survey’s own scope, evidence, and synthesis.

Language coverage skew
- Over-reliance on Indo-European (especially English) and Arabic; scarce work on African, Southeast Asian, and Indigenous languages and dialects.
- Limited study of morphologically rich and non-Latin scripts (e.g., Amharic, Hausa, Thai), and how script/morphology affect cross-lingual transfer.
Dataset scarcity and imbalance
- Few large, high-quality, labeled datasets for low-resource languages; most datasets are 10^3–10⁵ in size and platform-limited.
- Minimal availability of intensity/degree labels and fine-grained target annotations across languages; label ontologies are inconsistent.
Platform and domain diversity
- Heavy concentration on Twitter; limited evaluation across platforms (e.g., TikTok, WhatsApp, Telegram, VK, local forums) and community norms.
- Insufficient analysis of domain shift across platforms and genres in cross-lingual settings.
Code-mixing and transliteration
- Sparse, standardized evaluation of code-mixed/text normalization and transliteration pipelines across language pairs/scripts.
- Lack of benchmarks and error analyses specifically designed for code-switching phenomena.
Translation-dependent pipelines
- Inadequate assessment of translation errors, cultural nuance loss, and label noise introduced by MT-based data augmentation and label projection.
- No standard protocols for quality-control of machine translation/transliteration in offensive language contexts (e.g., slur preservation tests).
Label schema alignment and ontology mapping
- No widely adopted cross-lingual ontology to reconcile binary vs. multi-label, target-vs.-type taxonomies across datasets and cultures.
- Unclear best practices for mapping heterogeneous labels to a shared cross-lingual schema without erasing culturally specific phenomena.
Evaluation standardization and reproducibility
- Heterogeneous train/dev/test splits, metrics, and sampling strategies impede fair cross-paper comparisons or meta-analysis.
- Limited release of preprocessing scripts, training configurations, seeds, and model checkpoints; some datasets remain inaccessible.
Benchmark design
- Absence of a community benchmark suite that jointly covers low-resource languages, code-mixing, multiple platforms, and unified label ontologies.
- Few longitudinal or temporal benchmarks to study drift and evolving slurs across languages.
Transfer strategy comparability
- Limited head-to-head comparisons of instance-, feature-, and parameter-transfer approaches under controlled settings (same data/metrics).
- Lack of guidance on when to prefer each transfer level, and how to combine them effectively.
Zero-/few-shot protocols
- No standardized zero-, few-shot, and active-learning protocols to quantify data efficiency across language pairs and families.
- Sparse study of negative transfer: when and why cross-lingual transfer harms performance, and mitigation strategies.
Multilingual representation gaps
- Under-explored tokenization and subword choices for agglutinative/compounding languages and their impact on hate speech detection.
- Limited comparison of word-level vs. sentence-level multilingual representations tailored to this task.
Pre-trained models and LLMs
- Little systematic evaluation of instruction-tuned or large multilingual models (post-2023) for cross-lingual hate/offense detection.
- Unclear best practices for adapter-based, prompt-based, or parameter-efficient tuning across diverse languages.
Cultural and legal contextualization
- Insufficient modeling of culture-specific pragmatics, norms, and legal definitions of offense/hate; models often rely on English-centric cues.
- Lack of region-specific calibration and thresholding to align with local moderation policies and legal frameworks.
Bias, fairness, and harms
- Sparse cross-lingual fairness evaluations (e.g., group-wise error rates by identity terms, dialects); no standardized fairness metrics for multilingual settings.
- Limited techniques to debias multilingual embeddings/models while preserving cross-lingual performance.
Implicit abuse and pragmatics
- Insufficient treatment of implicit hate, sarcasm, coded language, and euphemisms that vary across cultures and languages.
- Few datasets or methods capturing context beyond a single message (e.g., conversation threads) in cross-lingual evaluation.
Robustness and adversarial resilience
- Limited cross-lingual robustness tests for obfuscation (leetspeak), misspellings, adversarial paraphrases, and script-switching.
- No standardized stress-tests for multilingual/offensive language perturbations.
Continual and life-long learning
- Little work on continual learning to track emerging slurs, memes, and shifting norms across languages without catastrophic forgetting.
- Few mechanisms for cross-lingual model updates that respect privacy and data-governance constraints.
Data governance and ethics
- Minimal discussion of privacy, consent, and data sovereignty in cross-border dataset sharing and annotation.
- Limited guidance on ethical deployment and localized human-in-the-loop moderation for cross-lingual systems.
Explainability and human factors
- Scarce research on cross-lingual explainability, user-facing rationales, and how explanations translate culturally across languages.
- Limited study of how annotator background and cultural context affect labels and model evaluation.
Language-agnostic signals
- Underuse of shared, language-agnostic signals beyond emojis/punctuation (e.g., network features, user histories, temporal patterns) for cross-lingual transfer.
- Lack of ablations showing the marginal utility of language-agnostic features across languages.
Cost-effective annotation
- Few investigations into cross-lingual active learning, weak supervision, or data programming to reduce labeling cost in low-resource languages.
- Under-explored crowdworker training and quality-control protocols tailored to culturally sensitive annotations.
Calibration and threshold transfer
- Little attention to multilingual probability calibration and how thresholds generalize (or not) across languages and prevalence rates.
- No standard practice for per-language calibration to reduce over-/under-flagging.
Error analysis by linguistic typology
- Limited systematic analyses connecting error patterns to linguistic features (e.g., morphology, word order, honorifics) or language distance.
- Few studies leveraging typological databases (e.g., WALS) to predict or mitigate transfer gaps.
Downstream impact
- Lack of user-centric and societal impact studies evaluating false positives/negatives’ consequences across cultural contexts.
- Limited collaboration with moderators and local communities to validate model decisions and reduce harm.

View Paper Prompt View All Prompts

Practical Applications

Immediate Applications

The following applications can be deployed now by leveraging the survey’s consolidated resources (datasets, tools, and methods) and the three transfer approaches (instance, feature, parameter transfer) it catalogues.

Multilingual content moderation for social platforms (industry: software, social media, gaming, live streaming)
- What: Rapidly expand hate/offense detection to under-supported languages using zero-shot/few-shot fine-tuning of multilingual PLMs (e.g., mBERT, XLM-R), plus code-mixed preprocessing (transliteration) for chats.
- How: Parameter transfer (fine-tune XLM-R on source language, adapt to target with few labels); instance transfer (MT/back-translation to augment scarce target data); feature transfer (LASER/LabSE for semantic similarity).
- Tools/workflows: Moderation API with language-agnostic scoring; triage dashboards with severity and target-type labels; human-in-the-loop review queues.
- Assumptions/dependencies: Access to platform data and label taxonomies; calibration for per-language thresholds; bias audits; reliable MT/transliteration for slang and dialects; legal compliance per jurisdiction.
Brand safety and ad placement filtering across languages (industry: advertising, marketing tech)
- What: Flag/avoid toxic user-generated content near ads in multilingual markets.
- How: Parameter transfer for cross-lingual classifiers; feature transfer to cluster and expand risk terms; lexicon-assisted explainability (HurtLex).
- Tools/workflows: Real-time page-level toxicity scoring; campaign-level safety rules by market; explanation snippets highlighting offensive spans.
- Assumptions: Coverage of local dialects; acceptable false-positive rate for brand safety; publisher integration.
Customer support and trust & safety triage (industry: enterprise SaaS, customer experience)
- What: Filter abusive tickets/emails/chats and route for escalation; protect agents in global contact centers.
- How: Zero-shot/few-shot fine-tuning (parameter transfer); instance transfer for synthetic data augmentation; code-mixed normalization.
- Tools/workflows: Severity tagging, auto-responses, agent shield mode (redaction/blur).
- Assumptions: Data privacy, PII handling; domain adaptation for formal vs informal registers.
Community moderation bots for multilingual spaces (industry/NGO: Discord/Telegram/Forums; daily life: creators, volunteer mods)
- What: Drop-in bots that detect hate/abuse across languages and code-mixed text.
- How: Lightweight XLM-R distilled variants; transliteration preprocessor; lexicon-backed rules for transparent moderation.
- Tools/workflows: Rule + ML hybrid pipelines; configurable thresholds; appeal logging.
- Assumptions: Compute and latency constraints; community norms and local policy alignment.
Rapid dataset bootstrapping in low-resource languages (academia/industry: data science, annotation vendors)
- What: Jumpstart labeled corpora using transfer and active learning to minimize annotation cost.
- How: Instance transfer (MT of labeled source data; back-translation); parameter transfer (few-shot fine-tuning to seed high-uncertainty sampling); feature transfer for semantic sampling (LASER).
- Tools/workflows: Annotation studio with model-in-the-loop suggestions; code-mix-aware guidelines; quality control metrics.
- Assumptions: Access to bilingual annotators; MT quality checks to avoid label shift; annotation budget.
Vendor/model benchmarking and procurement due diligence (industry/policy/academia)
- What: Evaluate moderation vendors’ cross-lingual performance using the survey’s dataset index and shared-task corpora.
- How: Assemble multilingual evaluation harness; report macro/micro-F1, robustness to code-mix, dialect shift, and translation artifacts.
- Tools/workflows: Continuous evaluation dashboard; red-teaming with lexicons and adversarial slang; bias and subgroup analyses (demographics/targets).
- Assumptions: License/access to datasets; legal/ethical approvals; standardized label taxonomies.
Search, retrieval, and analytics for abusive content (industry/policy: risk intelligence, brand monitoring)
- What: Multilingual toxicity search and clustering to detect campaigns or trends.
- How: Feature transfer via sentence embeddings (LASER/LabSE) for cross-lingual semantic search; lexicon seeding for targeted themes (e.g., sexism, racism).
- Tools/workflows: Topic clustering, timeline spikes, geolinguistic maps; alerting.
- Assumptions: Access to public streams/APIs; careful handling of domain drift.
Browser extensions/parenteral controls to mask harmful content (daily life)
- What: Client-side blur/redact of offensive text across languages on social sites and forums.
- How: Distilled multilingual PLMs; lexicon + ML hybrid for low-latency; on-device inference where feasible.
- Tools/workflows: User controls for sensitivity; reasons/explanations; allowlists.
- Assumptions: Performance on consumer hardware; false-positive UX impact.
Multilingual lexicon expansion and quality control pipelines (academia/industry)
- What: Maintain/upkeep hate/offense lexicons across languages and dialects.
- How: Instance transfer (translation/back-translation); feature transfer (nearest-neighbor in multilingual embeddings) to propose candidates; human vetting.
- Tools/workflows: Versioned lexicon repositories; change logs; severity ratings.
- Assumptions: Cultural nuance review; avoiding overbroad terms.
Elections and crisis monitoring across languages (policy/NGO/industry)
- What: Track surges in hateful narratives in multiple languages for early intervention.
- How: Cross-lingual classifiers; code-mix handling; target-type attribution (e.g., migrants, religion).
- Tools/workflows: Risk dashboards; escalation playbooks to platforms/partners.
- Assumptions: Data access and consent; legal guardrails; mitigation partnerships.
Teaching and reproducibility support (academia)
- What: Course modules and labs on cross-lingual transfer using the survey’s GitHub tables (datasets + CLTL methods).
- How: Replicate parameter-transfer baselines; compare instance vs feature transfer on new languages.
- Tools/workflows: Starter notebooks; standardized evaluation sheets.
- Assumptions: Dataset licensing; compute availability.

Long-Term Applications

The following applications are promising but typically require further research, scaling, or infrastructure development to be reliable and equitable at global scale.

Culture-aware severity calibration and harmonized taxonomies (industry/policy/academia)
- What: Align labels and severity across languages with culturally grounded guidelines to reduce over-/under-blocking.
- How: Cross-lingual annotation frameworks; hierarchical labels; adaptive thresholds per locale.
- Dependencies: Large, diverse, expert-annotated multilingual corpora; participatory design with local communities.
Robust code-mixed and dialectal coverage (industry/academia)
- What: High-accuracy detection in heavy code-switching and regional dialects (incl. transliteration variance).
- How: Unsupervised/adversarial alignment; continual learning from live streams; subword/phonetic modeling.
- Dependencies: Streaming data infrastructure; privacy-preserving learning; reliable evaluation sets.
Multimodal, cross-lingual moderation (industry)
- What: Jointly reason over text + images/memes/videos across languages.
- How: Vision–LLMs aligned with multilingual text encoders; contrastive training with meme datasets.
- Dependencies: Multimodal, multilingual datasets with fine-grained labels; safety-aligned training.
Privacy-preserving, federated CLTL (industry/policy)
- What: Train/update cross-lingual moderation models without centralizing sensitive data.
- How: Federated learning; differential privacy; secure aggregation across regions.
- Dependencies: Platform coordination; regulatory alignment; robustness vs. privacy trade-offs.
Model governance and regulatory audits across languages (policy/industry)
- What: Standardized, third-party audits of moderation quality, fairness, and errors in each language/market.
- How: Shared multilingual benchmarks; impact assessments; public reporting templates.
- Dependencies: Consensus on metrics; access to representative test data; legal frameworks.
On-device or edge cross-lingual moderation (industry/daily life)
- What: Real-time, offline filtering in messaging apps or parental controls.
- How: Distillation, quantization, and sparse architectures for multilingual PLMs.
- Dependencies: Efficient model research; hardware capabilities; update channels for new slang.
Early-warning systems for social conflict and targeted harassment (policy/NGO)
- What: Spatiotemporal monitoring of escalating hate narratives across languages.
- How: Cross-lingual event detection + toxicity scoring; graph-based campaign detection.
- Dependencies: Data-sharing agreements; false-positive risk management; response protocols.
Continual learning pipelines for evolving slang and evasion (industry/academia)
- What: Keep models current against obfuscation and new slurs across languages.
- How: Self-training, weak supervision, and human-in-the-loop refresh cycles.
- Dependencies: Drift detection; annotation ops; safe deployment thresholds.
Cross-lingual explainability and user-facing transparency (industry/policy)
- What: Provide localized, understandable rationales (e.g., highlighted spans, category/target) to users and moderators.
- How: Rationale extraction tied to multilingual lexicons; counterfactual explanations via translation-aware perturbations.
- Dependencies: Reliable rationale methods; UI/UX research; misinterpretation safeguards.
Open, community-maintained multilingual corpus and lexicon hub (academia/NGO/industry)
- What: A sustained, centralized, versioned repository building on the survey’s index, covering more languages and domains.
- How: Data contribution standards; quality governance; ethical licensing.
- Dependencies: Funding and stewardship; contributor incentives; legal vetting.
Cross-lingual counter-speech and assistance tools (industry/NGO)
- What: Generate or suggest context-aware, de-escalatory responses in users’ languages.
- How: Safety-aligned generative models with multilingual grounding; toxicity-aware retrieval augmentation.
- Dependencies: Guardrails to prevent misuse; evaluation of efficacy and harms; cultural tailoring.

View Paper Prompt View All Prompts

Glossary

Auto-encoder architecture: A neural model trained to reconstruct inputs via a compressed latent representation, used to learn language-agnostic features. "auto-encoder architecture"
Code-mixed datasets: Corpora in which multiple languages are intermixed within the same content, often at token or sentence level. "code-mixed datasets"
Code-mixing: The practice of alternating between languages within a single utterance or text. "code-mixed content"
Cross-lingual word embeddings: Word vector spaces aligned across languages so that semantically similar words are close regardless of language. "Cross-lingual word embeddings are also created by utilising these multilingual resources."
Cross-Lingual Transfer Learning (CLTL): Techniques that transfer knowledge from a resource-rich source language to improve performance in a resource-scarce target language. "Cross-Lingual Transfer Learning (CLTL) emerges as a promising direction"
Feature Transfer: A transfer approach that shares or aligns representational features across languages. "Feature Transfer: Linguistic knowledge is shared or transferred"
Few-shot learning: Adapting a model to a target language using only a small number of labeled examples. "few shots of $D_t$ "
Fine-tuning: Further training a pre-trained model on task- or language-specific data to adapt it to a target setting. "fine-tune $f_t$ "
HurtLex: A multilingual lexicon of offensive and hurtful terms used to support abusive language detection. "HurtLex."
Instance Transfer: A transfer approach that moves data instances (e.g., texts or labels) across languages to aid training. "Instance Transfer: Instances are transferred on the data level"
Language Agnostic BERT Sentence Embeddings (LabSE): Sentence-level multilingual embeddings designed to be language-agnostic. "Language Agnostic BERT Sentence Embeddings (LabSE)"
Language Agnostic Sentence Representations (LASER): Multilingual sentence embeddings aimed at producing language-agnostic representations. "Language Agnostic Sentence Representations (LASER)"
Language-agnostic resources: Signals (e.g., emojis, punctuation) not tied to a specific language that can support cross-lingual modeling. "Language-agnostic resources"
Low-resource languages: Languages with limited labeled data or computational resources for NLP tasks. "low-resource languages"
Machine transliteration tools: Systems that convert text from one script to another while preserving pronunciation. "machine transliteration tools"
Machine translation tools: Systems that automatically translate text between languages. "Machine translation tools"
mBART: A multilingual sequence-to-sequence pre-trained model often used for machine translation and generation. "mBART"
mT5: A multilingual text-to-text Transformer pre-trained model for a variety of sequence tasks. "mT5"
Multilingual BERT (mBERT): A version of BERT pre-trained on many languages, enabling cross-lingual transfer. "Multilingual BERT (mBERT)"
Multilingual contextualised representations: Context-sensitive sentence-level embeddings aligned across languages. "Multilingual contextualised representations"
Multilingual distributional representations: Word-level embeddings trained across languages that capture distributional semantics in a shared space. "Multilingual distributional representations"
Multilingual Lexicons (word-aligned): Dictionaries mapping words or phrases across languages for direct translation or equivalence. "Multilingual Lexicons (word-aligned)"
Multilingual PLMs: Pre-trained LLMs trained on multiple languages that enable zero-/few-shot cross-lingual transfer. "Multilingual PLMs"
Multilingual representations: Language-independent vector representations shared across languages to facilitate transfer. "Multilingual representations, as language independent representations"
Parallel Corpora (sentence-aligned): Collections of texts with aligned translations across languages at the sentence level. "Parallel Corpora (sentence-aligned)"
Parameter Transfer: A transfer approach that reuses model parameters learned in one language for another. "Parameter Transfer: Parameter values are transferred"
Pre-trained LLMs (PLMs): Models trained on large corpora to provide general-purpose language representations for downstream tasks. "Pre-trained LLMs (PLMs)"
SOTA: An abbreviation for state-of-the-art, referring to the best-performing techniques at a given time. "a SOTA technique in NLP"
XLM-RoBERTa (XLM-R): A strong multilingual Transformer model pre-trained on large-scale multilingual data. "XLM-RoBERTa (XLM-R)"

View Paper Prompt View All Prompts

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Continue Learning

Collections

GitHub

GitHub - aggiejiang/crosslingual-offensive-language-survey: Cross-lingual Offensive Language Detection: A Systematic Review of Datasets, Transfer Approaches and Challenges (1 star)

Cross-lingual Offensive Language Detection: A Systematic Review of Datasets, Transfer Approaches and Challenges

Summary

Introduction

Existing Datasets and Cross-Lingual Resources

Transfer Approaches in CLTL

Challenges and Future Prospects

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

Cross‑lingual Offensive Language Detection: A Teen‑Friendly Guide

What is this paper about?

What questions were the authors trying to answer?

How did they study it?

Key ideas explained simply

The three main ways knowledge is transferred across languages

What did they find, and why does it matter?

What’s the potential impact?

Short recap

Knowledge Gaps

Unresolved gaps, limitations, and open questions

Practical Applications

Immediate Applications

Long-Term Applications

Glossary

Open Problems

Continue Learning

Collections

GitHub

Tweets