iNLTK: Natural Language Toolkit for Indic Languages (2009.12534v2)

Published 26 Sep 2020 in cs.CL

Abstract: We present iNLTK, an open-source NLP library consisting of pre-trained LLMs and out-of-the-box support for Data Augmentation, Textual Similarity, Sentence Embeddings, Word Embeddings, Tokenization and Text Generation in 13 Indic Languages. By using pre-trained models from iNLTK for text classification on publicly available datasets, we significantly outperform previously reported results. On these datasets, we also show that by using pre-trained models and data augmentation from iNLTK, we can achieve more than 95% of the previous best performance by using less than 10% of the training data. iNLTK is already being widely used by the community and has 40,000+ downloads, 600+ stars and 100+ forks on GitHub. The library is available at https://github.com/goru001/inltk.

PDF Abstract

iNLTK: Natural Language Toolkit for Indic Languages

The paper discusses iNLTK, an open-source NLP library that addresses significant challenges in the field of natural language processing for Indic languages. This toolkit incorporates pre-trained models and provides support for several essential NLP tasks, such as Data Augmentation, Textual Similarity, Sentence and Word Embeddings, Tokenization, and Text Generation for 13 Indic languages.

Core Contributions

The development of iNLTK directly addresses the lack of pre-trained deep LLMs for Indic languages, which are spoken by a large demographic but often lack sufficient computational resources and labeled data. This library offers:

Pre-trained Deep LLMs: iNLTK provides a headstart for downstream tasks using transfer learning, bypassing the lack of resources in Indic languages.
Comprehensive NLP Support: Tasks like Text Classification, Textual Similarity, Sentence Embeddings, and Data Augmentation are readily available, reducing entry barriers for linguists and developers working with these languages.

The library supports 13 Indic languages: Hindi, Bengali, Gujarati, Malayalam, Marathi, Tamil, Punjabi, Kannada, Oriya, Sanskrit, Nepali, Urdu, and English. This expansive language support is fundamental given the diversity of languages in India and surrounding regions.

Technical Implementation

iNLTK employs models like ULMFiT and TransformerXL to train LLMs from scratch. By utilizing monolingual corpora sourced from Wikipedia, notably, these models provide substantially improved results in text classification tasks when compared to existing benchmarks, such as FastText and IndicNLP embeddings.

Tokenization: The paper details the creation of subword vocabularies using a SentencePiece tokenization model. This model ensures reversibility, crucial for generating valid LLMs. Table statistics reveal varied vocab sizes, aligning with the data availability of each language corpus.

LLM Training: The models were trained using PyTorch and Fastai, with TransformerXL demonstrating superior perplexity outcomes across languages, indicating its robustness in LLMing.

Results and Evaluation

The paper presents strong numerical results, demonstrating that iNLTK models significantly outperform prior models in text classification tasks across multiple datasets. For example, the tool achieved a notable accuracy of 90.71% on the Bengali Soham Articles dataset. Moreover, it shows innovative data augmentation approaches that maintain performance with reduced dataset sizes, achieving over 95% of previous accuracies using less than 10% of the training data.

Practical Implications and Future Research

The immediate implication of iNLTK is a lowered barrier to entry for applied research and product development in low-resource settings. Pre-trained models can be further fine-tuned, enhancing their applicability across domains where Indic languages are prevalent.

For future developments, the expansion to other Indic languages, including code-mixed variants, and inclusion of models like BERT is underway. Furthermore, the authors intend to address potential biases within pre-trained models, thus advocating for equitable NLP solutions.

Conclusion

iNLTK represents a substantial contribution to NLP for Indic languages, offering robust tools that meet the technical and practical demands of linguistic research and application. Its development path indicates a clear commitment to refining and expanding the capabilities of NLP tools for a linguistically diverse audience. The paper provides valuable insights and a practical framework for advancing natural language processing in a region that includes a significant portion of the world’s population.

PDF Markdown Bookmark Chat (Pro)

Authors (1)

Gaurav Arora (10 papers)

Citations (64)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - goru001/inltk: Natural Language Toolkit for Indic Languages aims to provide out of the box support for various NLP tasks that an application developer might need (825 stars)