IndiText Boost: Text Augmentation for Low Resource India Languages (2401.13085v1)
Abstract: Text Augmentation is an important task for low-resource languages. It helps deal with the problem of data scarcity. A data augmentation strategy is used to deal with the problem of data scarcity. Through the years, much work has been done on data augmentation for the English language. In contrast, very less work has been done on Indian languages. This is contrary to the fact that data augmentation is used to deal with data scarcity. In this work, we focus on implementing techniques like Easy Data Augmentation, Back Translation, Paraphrasing, Text Generation using LLMs, and Text Expansion using LLMs for text classification on different languages. We focus on 6 Indian languages namely: Sindhi, Marathi, Hindi, Gujarati, Telugu, and Sanskrit. According to our knowledge, no such work exists for text augmentation on Indian languages. We carry out binary as well as multi-class text classification to make our results more comparable. We get surprising results as basic data augmentation techniques surpass LLMs.
- 2021. Google Translate API. Google Cloud Documentation. Accessed: June 12, 2023.
- Jacob Andreas. 2019. Good-enough compositional data augmentation. arXiv preprint arXiv:1904.09545.
- Gaurav Arora. 2020. iNLTK: Natural language toolkit for indic languages. In Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS), pages 66–71, Online. Association for Computational Linguistics.
- Hostility detection dataset in hindi. arXiv preprint arXiv:2011.03588.
- Mixtext: Linguistically-informed interpolation of hidden space for semi-supervised text classification. arXiv preprint arXiv:2004.12239.
- Auggpt: Leveraging chatgpt for text data augmentation.
- Data augmentation for low-resource neural machine translation. arXiv preprint arXiv:1705.00440.
- A survey of data augmentation approaches for nlp. arXiv preprint arXiv:2105.03075.
- Rama Rohit Reddy Gangula and Radhika Mamidi. 2018. Resource creation towards automated sentiment analysis in Telugu (a low resource language) and integrating multiple domain sources to enhance sentiment prediction. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association (ELRA).
- Sequence-level mixed sample data augmentation. arXiv preprint arXiv:2011.09039.
- Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–1780.
- Data augmentation for visual question answering. In Proceedings of the 10th International Conference on Natural Language Generation, pages 198–202.
- Sosuke Kobayashi. 2018. Contextual augmentation: Data augmentation by words with paradigmatic relations. arXiv preprint arXiv:1805.06201.
- Model-portability experiments for textual temporal analysis. In Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, volume 2, pages 271–276. ACL; East Stroudsburg, PA.
- Data augmentation using pre-trained transformer models. arXiv preprint arXiv:2003.02245.
- Syntactic data augmentation increases robustness to inference heuristics. arXiv preprint arXiv:2004.11999.
- Nakatani Shuyo. 2010. langdetect. GitHub Repository.
- OpenAI. 2021. OpenAI GPT-3.5 API. OpenAI Documentation.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318.
- Sudalai Rajkumar. year. Telugu nlp dataset. https://www.kaggle.com/datasets/sudalairajkumar/telugu-nlp. Accessed: Month Day, Year.
- Owais Raza. 2022. Awamiawaz sindhi articles classification dataset. https://www.kaggle.com/datasets/owaisraza009/awamiawaz-sindhi-articles-classification-dataset. Accessed: May 22, 2023.
- Gözde Gül Şahin and Mark Steedman. 2019. Data augmentation via dependency tree morphing for low-resource languages. arXiv preprint arXiv:1903.09460.
- Improving neural machine translation models with monolingual data. arXiv preprint arXiv:1511.06709.
- L3cube-mahahate: A tweet-based marathi hate speech detection dataset and bert models.
- William Yang Wang and Diyi Yang. 2015. That’s so annoying!!!: A lexical and frame-semantic embedding based data augmentation approach to automatic categorization of annoying behaviors using# petpeeve tweets. In Proceedings of the 2015 conference on empirical methods in natural language processing, pages 2557–2563.
- Switchout: an efficient data augmentation algorithm for neural machine translation. arXiv preprint arXiv:1808.07512.
- Jason Wei and Kai Zou. 2019. Eda: Easy data augmentation techniques for boosting performance on text classification tasks. arXiv preprint arXiv:1901.11196.
- Onkar Litake (11 papers)
- Niraj Yagnik (2 papers)
- Shreyas Labhsetwar (1 paper)