Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

NLPDove at SemEval-2020 Task 12: Improving Offensive Language Detection with Cross-lingual Transfer (2008.01354v1)

Published 4 Aug 2020 in cs.CL

Abstract: This paper describes our approach to the task of identifying offensive languages in a multilingual setting. We investigate two data augmentation strategies: using additional semi-supervised labels with different thresholds and cross-lingual transfer with data selection. Leveraging the semi-supervised dataset resulted in performance improvements compared to the baseline trained solely with the manually-annotated dataset. We propose a new metric, Translation Embedding Distance, to measure the transferability of instances for cross-lingual data selection. We also introduce various preprocessing steps tailored for social media text along with methods to fine-tune the pre-trained multilingual BERT (mBERT) for offensive language identification. Our multilingual systems achieved competitive results in Greek, Danish, and Turkish at OffensEval 2020.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Hwijeen Ahn (5 papers)
  2. Jimin Sun (9 papers)
  3. Chan Young Park (20 papers)
  4. Jungyun Seo (2 papers)
Citations (24)