IndicLLMsuite: Empowering Indic LLMs with Rich Resources
Introduction
The monumental growth of research and development in LLMs primarily benefits English due to the abundance of resources. In contrast, languages from the Indian subcontinent, spoken by over 1.4 billion people, lag behind due to the dearth of comparable datasets and tailored resources. This research introduces IndicLLMsuite, a comprehensive suite aimed at bridging this gap, providing tools, datasets, and resources tailor-made for 22 constitutionally recognized Indian languages. With a total of 251B tokens for pre-training and 74.7M instruction-response pairs for fine-tuning, this suite is a significant step towards equitable AI advancements across languages.
Sangraha: A Multifaceted Pre-training Dataset
Sangraha is distinguished by its unique composition of manually verified data, unverified data, and synthetic data, aggregating a total of 251B tokens. The dataset comprises diverse sources including web content, PDFs, and videos. A notable feature of Sangraha is its emphasis on quality through human verification, alongside leveraging synthetic data to enhance dataset diversity. This approach offers a balanced representation of different content types, ensuring that the dataset is not only vast but also rich in quality and variety.
Setu: A Robust Curation Pipeline
The curation of Sangraha is facilitated by Setu, a Spark-based distributed pipeline customized for Indian languages. This pipeline addresses several critical steps in data processing, including extraction, cleaning, flagging, and deduplication. Setu's comprehensive architecture ensures the sanitization and refinement of data, making Sangraha a reliable source for training robust LLMs.
IndicAlign: Enriching Instruction Fine-Tuning Data
IndicAlign, part of IndicLLMsuite, offers a wide array of prompt-response pairs across 20 languages. It merges existing datasets, translates English datasets, and employs both human and synthetic generation methods to create context-grounded conversations. This diversity enriches the suite with culturally and contextually relevant datasets, aiding in comprehensive model training.
Theoretical and Practical Implications
The theoretical implications of this research are profound, demonstrating the viability of synthetic data generation in supporting low-resource languages. Practically, the release of IndicLLMsuite paves the way for advanced research and development of LLMs in Indian languages. It serves as a blueprint for extending similar efforts to other languages, advocating for a global approach toward equitable AI development.
Future Directions
This research invites collaboration for training high-quality Indian language LLMs through community-driven initiatives. By pooling resources, the AI community can achieve significant milestones in developing models that are not only linguistically inclusive but also culturally nuanced.
IndicLLMsuite represents a pivotal movement towards closing the linguistic divide in AI advancements, supporting the growth of LLMs across Indian languages. This progressive stride encourages the embrace of diversity and inclusivity in the field of AI, fostering developments that resonate with a broader spectrum of the global population.