Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

HinFlair: pre-trained contextual string embeddings for pos tagging and text classification in the Hindi language (2101.06949v1)

Published 18 Jan 2021 in cs.CL

Abstract: Recent advancements in LLMs based on recurrent neural networks and transformers architecture have achieved state-of-the-art results on a wide range of natural language processing tasks such as pos tagging, named entity recognition, and text classification. However, most of these LLMs are pre-trained in high resource languages like English, German, Spanish. Multi-lingual LLMs include Indian languages like Hindi, Telugu, Bengali in their training corpus, but they often fail to represent the linguistic features of these languages as they are not the primary language of the study. We introduce HinFlair, which is a language representation model (contextual string embeddings) pre-trained on a large monolingual Hindi corpus. Experiments were conducted on 6 text classification datasets and a Hindi dependency treebank to analyze the performance of these contextualized string embeddings for the Hindi language. Results show that HinFlair outperforms previous state-of-the-art publicly available pre-trained embeddings for downstream tasks like text classification and pos tagging. Also, HinFlair when combined with FastText embeddings outperforms many transformers-based LLMs trained particularly for the Hindi language.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (1)
  1. Harsh Patel (17 papers)
Citations (1)