Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Cross-lingual Offensive Language Identification for Low Resource Languages: The Case of Marathi (2109.03552v1)

Published 8 Sep 2021 in cs.CL, cs.AI, cs.LG, cs.NE, and cs.SI

Abstract: The widespread presence of offensive language on social media motivated the development of systems capable of recognizing such content automatically. Apart from a few notable exceptions, most research on automatic offensive language identification has dealt with English. To address this shortcoming, we introduce MOLD, the Marathi Offensive Language Dataset. MOLD is the first dataset of its kind compiled for Marathi, thus opening a new domain for research in low-resource Indo-Aryan languages. We present results from several machine learning experiments on this dataset, including zero-short and other transfer learning experiments on state-of-the-art cross-lingual transformers from existing data in Bengali, English, and Hindi.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Saurabh Gaikwad (2 papers)
  2. Tharindu Ranasinghe (52 papers)
  3. Marcos Zampieri (94 papers)
  4. Christopher M. Homan (22 papers)
Citations (62)