Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

IndicLLMSuite: A Blueprint for Creating Pre-training and Fine-Tuning Datasets for Indian Languages (2403.06350v2)

Published 11 Mar 2024 in cs.CL

Abstract: Despite the considerable advancements in English LLMs, the progress in building comparable models for other languages has been hindered due to the scarcity of tailored resources. Our work aims to bridge this divide by introducing an expansive suite of resources specifically designed for the development of Indic LLMs, covering 22 languages, containing a total of 251B tokens and 74.8M instruction-response pairs. Recognizing the importance of both data quality and quantity, our approach combines highly curated manually verified data, unverified yet valuable data, and synthetic data. We build a clean, open-source pipeline for curating pre-training data from diverse sources, including websites, PDFs, and videos, incorporating best practices for crawling, cleaning, flagging, and deduplication. For instruction-fine tuning, we amalgamate existing Indic datasets, translate/transliterate English datasets into Indian languages, and utilize LLaMa2 and Mixtral models to create conversations grounded in articles from Indian Wikipedia and Wikihow. Additionally, we address toxicity alignment by generating toxic prompts for multiple scenarios and then generate non-toxic responses by feeding these toxic prompts to an aligned LLaMa2 model. We hope that the datasets, tools, and resources released as a part of this work will not only propel the research and development of Indic LLMs but also establish an open-source blueprint for extending such efforts to other languages. The data and other artifacts created as part of this work are released with permissive licenses.

IndicLLMsuite: Empowering Indic LLMs with Rich Resources

Introduction

The monumental growth of research and development in LLMs primarily benefits English due to the abundance of resources. In contrast, languages from the Indian subcontinent, spoken by over 1.4 billion people, lag behind due to the dearth of comparable datasets and tailored resources. This research introduces IndicLLMsuite, a comprehensive suite aimed at bridging this gap, providing tools, datasets, and resources tailor-made for 22 constitutionally recognized Indian languages. With a total of 251B tokens for pre-training and 74.7M instruction-response pairs for fine-tuning, this suite is a significant step towards equitable AI advancements across languages.

Sangraha: A Multifaceted Pre-training Dataset

Sangraha is distinguished by its unique composition of manually verified data, unverified data, and synthetic data, aggregating a total of 251B tokens. The dataset comprises diverse sources including web content, PDFs, and videos. A notable feature of Sangraha is its emphasis on quality through human verification, alongside leveraging synthetic data to enhance dataset diversity. This approach offers a balanced representation of different content types, ensuring that the dataset is not only vast but also rich in quality and variety.

Setu: A Robust Curation Pipeline

The curation of Sangraha is facilitated by Setu, a Spark-based distributed pipeline customized for Indian languages. This pipeline addresses several critical steps in data processing, including extraction, cleaning, flagging, and deduplication. Setu's comprehensive architecture ensures the sanitization and refinement of data, making Sangraha a reliable source for training robust LLMs.

IndicAlign: Enriching Instruction Fine-Tuning Data

IndicAlign, part of IndicLLMsuite, offers a wide array of prompt-response pairs across 20 languages. It merges existing datasets, translates English datasets, and employs both human and synthetic generation methods to create context-grounded conversations. This diversity enriches the suite with culturally and contextually relevant datasets, aiding in comprehensive model training.

Theoretical and Practical Implications

The theoretical implications of this research are profound, demonstrating the viability of synthetic data generation in supporting low-resource languages. Practically, the release of IndicLLMsuite paves the way for advanced research and development of LLMs in Indian languages. It serves as a blueprint for extending similar efforts to other languages, advocating for a global approach toward equitable AI development.

Future Directions

This research invites collaboration for training high-quality Indian language LLMs through community-driven initiatives. By pooling resources, the AI community can achieve significant milestones in developing models that are not only linguistically inclusive but also culturally nuanced.

IndicLLMsuite represents a pivotal movement towards closing the linguistic divide in AI advancements, supporting the growth of LLMs across Indian languages. This progressive stride encourages the embrace of diversity and inclusivity in the field of AI, fostering developments that resonate with a broader spectrum of the global population.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (82)
  1. Towards a cleaner document-oriented multilingual crawled corpus. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pp.  4344–4355, Marseille, France, June 2022. European Language Resources Association. URL https://aclanthology.org/2022.lrec-1.463.
  2. Semdedup: Data-efficient learning at web-scale through semantic deduplication. CoRR, abs/2303.09540, 2023. doi: 10.48550/ARXIV.2303.09540. URL https://doi.org/10.48550/arXiv.2303.09540.
  3. The falcon series of open language models, 2023.
  4. Author Anonymousand Anonymous. Anonymous title. 2024.
  5. Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022.
  6. Adrien Barbaresi. Trafilatura: A web scraping library and command-line tool for text discovery and extraction. In Heng Ji, Jong C. Park, and Rui Xia (eds.), Proceedings of the Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL 2021 - System Demonstrations, Online, August 1-6, 2021, pp.  122–131. Association for Computational Linguistics, 2021. doi: 10.18653/V1/2021.ACL-DEMO.15. URL https://doi.org/10.18653/v1/2021.acl-demo.15.
  7. Pushpak Bhattacharyya. Indowordnet. In Lexical Resources Engineering Conference 2010 (LREC 2010), Malta, May 2010.
  8. Effectiveness of mining audio and text pairs from public data for improving asr systems for low-resource languages, 2022.
  9. Natural language processing with small feed-forward networks. In Martha Palmer, Rebecca Hwa, and Sebastian Riedel (eds.), Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 9-11, 2017, pp.  2879–2885. Association for Computational Linguistics, 2017. doi: 10.18653/V1/D17-1309. URL https://doi.org/10.18653/v1/d17-1309.
  10. Andrei Z. Broder. On the resemblance and containment of documents. In Bruno Carpentieri, Alfredo De Santis, Ugo Vaccaro, and James A. Storer (eds.), Compression and Complexity of SEQUENCES 1997, Positano, Amalfitan Coast, Salerno, Italy, June 11-13, 1997, Proceedings, pp.  21–29. IEEE, 1997. doi: 10.1109/SEQUEN.1997.666900. URL https://doi.org/10.1109/SEQUEN.1997.666900.
  11. Language models are few-shot learners, 2020.
  12. Quantifying memorization across neural language models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL https://openreview.net/pdf?id=TatRHT_1cK.
  13. Moses Charikar. Similarity estimation techniques from rounding algorithms. In John H. Reif (ed.), Proceedings on 34th Annual ACM Symposium on Theory of Computing, May 19-21, 2002, Montréal, Québec, Canada, pp.  380–388. ACM, 2002. doi: 10.1145/509907.509965. URL https://doi.org/10.1145/509907.509965.
  14. Together Computer. Redpajama: An open source recipe to reproduce llama training dataset, 2023. URL https://github.com/togethercomputer/RedPajama-Data.
  15. Unsupervised cross-lingual representation learning at scale. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  8440–8451, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.747. URL https://aclanthology.org/2020.acl-main.747.
  16. Free dolly: Introducing the world’s first truly open instruction-tuned llm, 2023. URL https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm.
  17. No language left behind: Scaling human-centered machine translation. CoRR, abs/2207.04672, 2022. doi: 10.48550/ARXIV.2207.04672. URL https://doi.org/10.48550/arXiv.2207.04672.
  18. Cem Dilmegani. Ocr in 2024: Benchmarking text extraction/capture accuracy. 2023. URL https://research.aimultiple.com/ocr-accuracy.
  19. Enhancing chat language models by scaling high-quality instructional conversations. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pp.  3029–3051. Association for Computational Linguistics, 2023a. URL https://aclanthology.org/2023.emnlp-main.183.
  20. Enhancing chat language models by scaling high-quality instructional conversations, 2023b.
  21. Towards leaving no Indic language behind: Building monolingual corpora, benchmark and models for Indic languages. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  12402–12426, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.693. URL https://aclanthology.org/2023.acl-long.693.
  22. More effective boilerplate removal - the goldminer algorithm. Polibits, 48:79–83, 2013. doi: 10.17562/PB-48-10. URL https://doi.org/10.17562/pb-48-10.
  23. Christiane Fellbaum (ed.). WordNet: An Electronic Lexical Database. MIT Press, Cambridge, MA, 1998.
  24. Indictrans2: Towards high-quality and accessible machine translation models for all 22 scheduled indian languages, 2023.
  25. Airavata: Introducing hindi instruction-tuned llm, 2024.
  26. The pile: An 800gb dataset of diverse text for language modeling. CoRR, abs/2101.00027, 2021. URL https://arxiv.org/abs/2101.00027.
  27. Time travel in llms: Tracing data contamination in large language models, 2024.
  28. Textbooks are all you need, 2023.
  29. How close is chatgpt to human experts? comparison corpus, evaluation, and detection. CoRR, abs/2301.07597, 2023. doi: 10.48550/ARXIV.2301.07597. URL https://doi.org/10.48550/arXiv.2301.07597.
  30. indic-punct: An automatic punctuation restoration and inverse text normalization framework for indic languages, 2022.
  31. Kenneth Heafield. KenLM: Faster and smaller language model queries. In Chris Callison-Burch, Philipp Koehn, Christof Monz, and Omar F. Zaidan (eds.), Proceedings of the Sixth Workshop on Statistical Machine Translation, pp.  187–197, Edinburgh, Scotland, July 2011. Association for Computational Linguistics. URL https://aclanthology.org/W11-2123.
  32. Scaling laws and interpretability of learning from repeated data. CoRR, abs/2205.10487, 2022. doi: 10.48550/ARXIV.2205.10487. URL https://doi.org/10.48550/arXiv.2205.10487.
  33. Romansetu: Efficiently unlocking multilingual capabilities of large language models models via romanization, 2024.
  34. Stop uploading test data in plain text: Practical strategies for mitigating data contamination by evaluation benchmarks. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pp.  5075–5084. Association for Computational Linguistics, 2023. URL https://aclanthology.org/2023.emnlp-main.308.
  35. Mistral 7b, 2023.
  36. Mixtral of experts. CoRR, abs/2401.04088, 2024a. doi: 10.48550/ARXIV.2401.04088. URL https://doi.org/10.48550/arXiv.2401.04088.
  37. Mixtral of experts, 2024b.
  38. IndicNLPSuite: Monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for Indian languages. In Trevor Cohn, Yulan He, and Yang Liu (eds.), Findings of the Association for Computational Linguistics: EMNLP 2020, pp.  4948–4961, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.445. URL https://aclanthology.org/2020.findings-emnlp.445.
  39. Openassistant conversations - democratizing large language model alignment. CoRR, abs/2304.07327, 2023. doi: 10.48550/ARXIV.2304.07327. URL https://doi.org/10.48550/arXiv.2304.07327.
  40. Quality at a glance: An audit of web-crawled multilingual datasets. Trans. Assoc. Comput. Linguistics, 10:50–72, 2022a. doi: 10.1162/TACL_A_00447. URL https://doi.org/10.1162/tacl_a_00447.
  41. Quality at a glance: An audit of web-crawled multilingual datasets. Transactions of the Association for Computational Linguistics, 10:50–72, 2022b. doi: 10.1162/tacl˙a˙00447. URL https://aclanthology.org/2022.tacl-1.4.
  42. Madlad-400: A multilingual and document-level large audited dataset, 2023.
  43. Anoop Kunchukuttan. The indicnlp library. https://github.com/anoopkunchukuttan/indic_nlp_library/blob/master/docs/indicnlp.pdf, 2020.
  44. Openassistant conversations – democratizing large language model alignment, 2023.
  45. The bigscience roots corpus: A 1.6tb composite multilingual dataset, 2023.
  46. Deduplicating training data makes language models better. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pp.  8424–8445. Association for Computational Linguistics, 2022. doi: 10.18653/V1/2022.ACL-LONG.577. URL https://doi.org/10.18653/v1/2022.acl-long.577.
  47. CAMEL: communicative agents for ”mind” exploration of large scale language model society. CoRR, abs/2303.17760, 2023a. doi: 10.48550/ARXIV.2303.17760. URL https://doi.org/10.48550/arXiv.2303.17760.
  48. Camel: Communicative agents for ”mind” exploration of large language model society, 2023b.
  49. Bactrian-x: Multilingual replicable instruction-following models with low-rank adaptation, 2023c.
  50. Textbooks are all you need ii: phi-1.5 technical report, 2023d.
  51. The flan collection: Designing data and methods for effective instruction tuning, 2023.
  52. Nltk: The natural language toolkit, 2002.
  53. Bhasa-abhijnaanam: Native-script and romanized language identification for 22 Indic languages. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp.  816–826, Toronto, Canada, July 2023a. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-short.71. URL https://aclanthology.org/2023.acl-short.71.
  54. Aksharantar: Open Indic-language transliteration datasets and models for the next billion users. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, pp.  40–57, Singapore, December 2023b. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.4. URL https://aclanthology.org/2023.findings-emnlp.4.
  55. Mtld, vocdd, and hd-d: A validation study of sophisticated approaches to lexical diversity assessment, 2010.
  56. MediaWiki. Api:rest api/reference — mediawiki,, 2023. URL https://www.mediawiki.org/w/index.php?title=API:REST_API/Reference&oldid=6225064. [Online; accessed 12-February-2024].
  57. An experience in building the indo wordnet-a wordnet for hindi. In First International Conference on Global WordNet, Mysore, India, 2002.
  58. Culturax: A cleaned, enormous, and multilingual dataset for large language models in 167 languages, 2023a.
  59. Seallms – large language models for southeast asia, 2023b.
  60. pyiwn: A python based API to access Indian language WordNets. In Francis Bond, Piek Vossen, and Christiane Fellbaum (eds.), Proceedings of the 9th Global Wordnet Conference, pp.  378–383, Nanyang Technological University (NTU), Singapore, January 2018. Global Wordnet Association. URL https://aclanthology.org/2018.gwc-1.47.
  61. The refinedweb dataset for falcon LLM: outperforming curated corpora with web data, and web data only. CoRR, abs/2306.01116, 2023. doi: 10.48550/ARXIV.2306.01116. URL https://doi.org/10.48550/arXiv.2306.01116.
  62. Scaling language models: Methods, analysis & insights from training gopher. CoRR, abs/2112.11446, 2021. URL https://arxiv.org/abs/2112.11446.
  63. Scaling language models: Methods, analysis & insights from training gopher, 2022.
  64. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1–140:67, 2020. URL http://jmlr.org/papers/v21/20-074.html.
  65. Samanantar: The largest publicly available parallel corpora collection for 11 Indic languages. Transactions of the Association for Computational Linguistics, 10:145–162, 2022. doi: 10.1162/tacl˙a˙00452. URL https://aclanthology.org/2022.tacl-1.9.
  66. NLP evaluation in trouble: On the need to measure LLM data contamination for each benchmark. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, pp.  10776–10787, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.722. URL https://aclanthology.org/2023.findings-emnlp.722.
  67. Scrapinghub. Article extraction benchmark, 2021. URL https://github.com/scrapinghub/article-extraction-benchmark. GitHub repository.
  68. Dolma: an open corpus of three trillion tokens for language model pretraining research, 2024.
  69. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. CoRR, abs/2206.04615, 2022. doi: 10.48550/ARXIV.2206.04615. URL https://doi.org/10.48550/arXiv.2206.04615.
  70. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  71. Llama: Open and efficient foundation language models, 2023a.
  72. Llama 2: Open foundation and fine-tuned chat models, 2023b.
  73. Self-instruct: Aligning language models with self-generated instructions. In Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pp.  13484–13508. Association for Computational Linguistics, 2023a. doi: 10.18653/V1/2023.ACL-LONG.754. URL https://doi.org/10.18653/v1/2023.acl-long.754.
  74. Self-instruct: Aligning language models with self-generated instructions, 2023b.
  75. CCNet: Extracting high quality monolingual datasets from web crawl data. In Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, and Stelios Piperidis (eds.), Proceedings of the Twelfth Language Resources and Evaluation Conference, pp.  4003–4012, Marseille, France, May 2020. European Language Resources Association. ISBN 979-10-95546-34-4. URL https://aclanthology.org/2020.lrec-1.494.
  76. Wizardlm: Empowering large language models to follow complex instructions. CoRR, abs/2304.12244, 2023a. doi: 10.48550/ARXIV.2304.12244. URL https://doi.org/10.48550/arXiv.2304.12244.
  77. Wizardlm: Empowering large language models to follow complex instructions, 2023b.
  78. Baize: An open-source chat model with parameter-efficient tuning on self-chat data. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pp.  6268–6278. Association for Computational Linguistics, 2023c. URL https://aclanthology.org/2023.emnlp-main.385.
  79. mt5: A massively multilingual pre-trained text-to-text transformer, 2021.
  80. Apache spark: a unified engine for big data processing. Commun. ACM, 59(11):56–65, oct 2016. ISSN 0001-0782. doi: 10.1145/2934664. URL https://doi.org/10.1145/2934664.
  81. (inthe)wildchat: 570k chatGPT interaction logs in the wild. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=Bl8u7ZRlbM.
  82. LIMA: less is more for alignment. CoRR, abs/2305.11206, 2023. doi: 10.48550/ARXIV.2305.11206. URL https://doi.org/10.48550/arXiv.2305.11206.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (12)
  1. Mohammed Safi Ur Rahman Khan (8 papers)
  2. Priyam Mehta (2 papers)
  3. Ananth Sankar (5 papers)
  4. Umashankar Kumaravelan (2 papers)
  5. Sumanth Doddapaneni (16 papers)
  6. Varun Balan G (1 paper)
  7. Sparsh Jain (7 papers)
  8. Anoop Kunchukuttan (45 papers)
  9. Pratyush Kumar (44 papers)
  10. Raj Dabre (65 papers)
  11. Mitesh M. Khapra (79 papers)
  12. Suriyaprasaad B (1 paper)
Citations (11)