Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

On Importance of Code-Mixed Embeddings for Hate Speech Identification (2411.18577v1)

Published 27 Nov 2024 in cs.CL and cs.LG

Abstract: Code-mixing is the practice of using two or more languages in a single sentence, which often occurs in multilingual communities such as India where people commonly speak multiple languages. Classic NLP tools, trained on monolingual data, face challenges when dealing with code-mixed data. Extracting meaningful information from sentences containing multiple languages becomes difficult, particularly in tasks like hate speech detection, due to linguistic variation, cultural nuances, and data sparsity. To address this, we aim to analyze the significance of code-mixed embeddings and evaluate the performance of BERT and HingBERT models (trained on a Hindi-English corpus) in hate speech detection. Our study demonstrates that HingBERT models, benefiting from training on the extensive Hindi-English dataset L3Cube-HingCorpus, outperform BERT models when tested on hate speech text datasets. We also found that code-mixed Hing-FastText performs better than standard English FastText and vanilla BERT models.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Shruti Jagdale (2 papers)
  2. Omkar Khade (2 papers)
  3. Gauri Takalikar (2 papers)
  4. Mihir Inamdar (1 paper)
  5. Raviraj Joshi (76 papers)

Summary

We haven't generated a summary for this paper yet.