Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Language Identification of Hindi-English tweets using code-mixed BERT (2107.01202v1)

Published 2 Jul 2021 in cs.CL and cs.LG

Abstract: Language identification of social media text has been an interesting problem of study in recent years. Social media messages are predominantly in code mixed in non-English speaking states. Prior knowledge by pre-training contextual embeddings have shown state of the art results for a range of downstream tasks. Recently, models such as BERT have shown that using a large amount of unlabeled data, the pretrained LLMs are even more beneficial for learning common language representations. Extensive experiments exploiting transfer learning and fine-tuning BERT models to identify language on Twitter are presented in this paper. The work utilizes a data collection of Hindi-English-Urdu codemixed text for language pre-training and Hindi-English codemixed for subsequent word-level language classification. The results show that the representations pre-trained over codemixed data produce better results by their monolingual counterpart.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Mohd Zeeshan Ansari (6 papers)
  2. M M Sufyan Beg (8 papers)
  3. Tanvir Ahmad (17 papers)
  4. Mohd Jazib Khan (1 paper)
  5. Ghazali Wasim (1 paper)
Citations (14)