Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Pretraining Data and Tokenizer for Indic LLM (2407.12481v1)

Published 17 Jul 2024 in cs.CL

Abstract: We present a novel approach to data preparation for developing multilingual Indic LLM. Our meticulous data acquisition spans open-source and proprietary sources, including Common Crawl, Indic books, news articles, and Wikipedia, ensuring a diverse and rich linguistic representation. For each Indic language, we design a custom preprocessing pipeline to effectively eliminate redundant and low-quality text content. Additionally, we perform deduplication on Common Crawl data to address the redundancy present in 70% of the crawled web pages. This study focuses on developing high-quality data, optimizing tokenization for our multilingual dataset for Indic LLMs with 3B and 7B parameters, engineered for superior performance in Indic languages. We introduce a novel multilingual tokenizer training strategy, demonstrating our custom-trained Indic tokenizer outperforms the state-of-the-art OpenAI Tiktoken tokenizer, achieving a superior token-to-word ratio for Indic languages.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Rahul Kumar (169 papers)
  2. Shubham Kakde (1 paper)
  3. Divyansh Rajput (1 paper)
  4. Daud Ibrahim (1 paper)
  5. Rishabh Nahata (1 paper)
  6. Pidathala Sowjanya (1 paper)
  7. Deepak Kumar (104 papers)
Citations (1)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets