Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Part-of-Speech Tagger for Bodo Language using Deep Learning approach (2401.03175v1)

Published 6 Jan 2024 in cs.CL, cs.AI, and cs.LG

Abstract: Language Processing systems such as Part-of-speech tagging, Named entity recognition, Machine translation, Speech recognition, and LLMing (LM) are well-studied in high-resource languages. Nevertheless, research on these systems for several low-resource languages, including Bodo, Mizo, Nagamese, and others, is either yet to commence or is in its nascent stages. LLM plays a vital role in the downstream tasks of modern NLP. Extensive studies are carried out on LMs for high-resource languages. Nevertheless, languages such as Bodo, Rabha, and Mising continue to lack coverage. In this study, we first present BodoBERT, a LLM for the Bodo language. To the best of our knowledge, this work is the first such effort to develop a LLM for Bodo. Secondly, we present an ensemble DL-based POS tagging model for Bodo. The POS tagging model is based on combinations of BiLSTM with CRF and stacked embedding of BodoBERT with BytePairEmbeddings. We cover several LLMs in the experiment to see how well they work in POS tagging tasks. The best-performing model achieves an F1 score of 0.8041. A comparative experiment was also conducted on Assamese POS taggers, considering that the language is spoken in the same region as Bodo.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Dhrubajyoti Pathak (3 papers)
  2. Sanjib Narzary (2 papers)
  3. Sukumar Nandi (13 papers)
  4. Bidisha Som (2 papers)