L3Cube-IndicNews: News-based Short Text and Long Document Classification Datasets in Indic Languages (2401.02254v2)
Abstract: In this work, we introduce L3Cube-IndicNews, a multilingual text classification corpus aimed at curating a high-quality dataset for Indian regional languages, with a specific focus on news headlines and articles. We have centered our work on 10 prominent Indic languages, including Hindi, Bengali, Marathi, Telugu, Tamil, Gujarati, Kannada, Odia, Malayalam, and Punjabi. Each of these news datasets comprises 10 or more classes of news articles. L3Cube-IndicNews offers 3 distinct datasets tailored to handle different document lengths that are classified as: Short Headlines Classification (SHC) dataset containing the news headline and news category, Long Document Classification (LDC) dataset containing the whole news article and the news category, and Long Paragraph Classification (LPC) containing sub-articles of the news and the news category. We maintain consistent labeling across all 3 datasets for in-depth length-based analysis. We evaluate each of these Indic language datasets using 4 different models including monolingual BERT, multilingual Indic Sentence BERT (IndicSBERT), and IndicBERT. This research contributes significantly to expanding the pool of available text classification datasets and also makes it possible to develop topic classification models for Indian regional languages. This also serves as an excellent resource for cross-lingual analysis owing to the high overlap of labels among languages. The datasets and models are shared publicly at https://github.com/l3cube-pune/indic-nlp
- V\\\backslash\= arta: A large-scale headline-generation dataset for indic languages. arXiv preprint arXiv:2305.05858.
- Gaurav Arora. 2020. inltk: Natural language toolkit for indic languages. arXiv preprint arXiv:2009.12534.
- Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150.
- Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116.
- L3cube-indicsbert: A simple approach for learning cross-lingual sentence representations using multilingual bert. arXiv preprint arXiv:2304.11434.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- L3cube-mahasbert and hindsbert: Sentence bert models and benchmarking bert sentence representations for hindi and marathi. In Science and Information Conference, pages 1184–1199. Springer.
- Raviraj Joshi. 2022a. L3cube-hindbert and devbert: Pre-trained bert transformer models for devanagari based hindi and marathi languages. arXiv preprint arXiv:2211.11418.
- Raviraj Joshi. 2022b. L3cube-mahacorpus and mahabert: Marathi monolingual corpus, marathi bert language models, and resources. In LREC 2022 Workshop Language Resources and Evaluation Conference 20-25 June 2022, page 97.
- Indicnlpsuite: Monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for indian languages. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4948–4961.
- Muril: Multilingual representations for indian languages. arXiv preprint arXiv:2103.10730.
- Siamese neural networks for one-shot image recognition. In ICML deep learning workshop, volume 2. Lille.
- Experimental evaluation of deep learning models for marathi text classification. In Proceedings of the 2nd International Conference on Recent Trends in Machine Learning, IoT, Smart Cities and Applications: ICMISC 2021, pages 605–613. Springer.
- Ai4bharat-indicnlp corpus: Monolingual corpora and word embeddings for indic languages. arXiv preprint arXiv:2005.00085.
- Sandeep Sricharan Mukku and Radhika Mamidi. 2017. Actsa: Annotated corpus for telugu sentiment analysis. In Proceedings of the First Workshop on Building Linguistically Generalizable NLP Systems, pages 54–58.
- Harshali B Patil and Ajay S Patil. 2017. Mars: a rule-based stemmer for morphologically rich language marathi. In 2017 international conference on computer, communications and electronics (Comptelix), pages 580–584. IEEE.
- Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084.
- Aishwarya Mirashi (2 papers)
- Srushti Sonavane (3 papers)
- Purva Lingayat (2 papers)
- Tejas Padhiyar (2 papers)
- Raviraj Joshi (76 papers)