Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

102 tokens/sec

GPT-4o

59 tokens/sec

Gemini 2.5 Pro Pro

43 tokens/sec

o3 Pro

6 tokens/sec

GPT-4.1 Pro

50 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

109 1

Do Not Worry if You Do Not Have Data: Building Pretrained Language Models Using Translationese (2403.13638v2)

Published 20 Mar 2024 in cs.CL

Abstract: In this paper, we explore the utility of Translationese as synthetic data created using machine translation for pre-training LLMs (LMs). Pre-training requires vast amounts of monolingual data, which is mostly unavailable for languages other than English. Recently, there has been a growing interest in using synthetic data to address this data scarcity. We take the case of English and Indic languages and translate web-crawled monolingual documents (clean) into the target language. Then, we train LLMs containing 28M and 85M parameters on this translationese data (synthetic). We show that their performance on downstream natural language understanding and generative tasks is only 3.56% poorer on NLU tasks and 1.51% on NLG tasks than LMs pre-trained on clean data. Further, we propose the use of lightweight TinyLMs pre-trained on clean data to filter synthetic data efficiently which significantly improves the performance of our models. We also find that LMs trained on synthetic data strongly benefit from extended pretraining on a tiny fraction (10%) of clean data. We release the data we collected and created as a part of this work, IndicMonoDoc, the largest collection of monolingual document-level corpora, which we hope will help bridge the gap between English and non-English performance for LLMs.

View on arXiv

References (63)

Authors (3)

Meet Doshi (4 papers)
Raj Dabre (65 papers)
Pushpak Bhattacharyya (153 papers)

Summary

Leveraging Translationese for Pre-training LLMs in Low-Resource Languages

Introduction to Synthetic Data in LLM Training

The reliance of state-of-the-art LLMs (LMs) on vast amounts of monolingual data presents a significant challenge for low-resource languages. Traditionally, the performance of LMs in languages such as English, with abundant data resources, towers over that in languages with scarcer resources. This disparity raises the question of how to effectively train LMs for these underrepresented languages. A novel approach addresses this by employing synthetic data, specifically "translationese", created through machine translation from high-resource languages. This paper endeavors to assess the viability and performance implications of using translationese for pre-training LMs across a range of languages, focusing primarily on English and Indic languages.

Methodological Approach to Synthetic Data Utilization

The procedure for synthetic data utilization includes several stages:

Data Generation: Starting with web-crawled monolingual documents in English, these documents are translated into target Indic languages using state-of-the-art machine translation models, generating what is referred to as translationese or synthetic data.
LLMs Training: Two sets of LMs, encompassing model sizes of 28M and 85M parameters, are trained on this synthetic data.
TinyLMs for Data Filtering: To address the quality variance in synthetic data, Tiny LLMs (TinyLMs), pre-trained on high-quality, clean data, are utilized to filter the synthetic data based on perplexity scores. This step aims to refine the quality of data used for training larger models.
Performance Evaluation: The paper evaluates the performance of LMs trained on synthetic, filtered data against those trained on clean data, across various natural language understanding (NLU) and generative (NLG) tasks.

Key Findings

The paper unveils several crucial insights:

LMs trained on synthetic data, even before filtering, demonstrate competitive performance, lagging by only 3.56% on NLU tasks and 1.51% on NLG tasks compared to those trained on clean data.
The introduction of TinyLM-based filtered synthetic data significantly enhances performance, bridging the gap between synthetic and clean data-trained models.
An intriguing outcome is observed when these synthetic data-trained LMs undergo further pre-training on a small fraction (10%) of clean data, which substantially improves their performance across tasks.

Bridging the Data Gap with IndicMonoDoc

Contributing to the literature and practical resources, the authors introduce IndicMonoDoc, a massive collection of monolingual document-level corpora, amounting to a total of 39.5 billion tokens across 22 scheduled languages plus English. This dataset, significantly larger than its predecessors, aims to furnish researchers and practitioners with the data necessary for training more capable and inclusive LLMs.

Implications and Future Directions

This work underscores the potential of using synthetic data, particularly translationese, as a strategic method to circumvent the data scarcity issue plaguing low-resource LLM training. The findings suggest that, with effective filtering strategies and supplementary training on clean data, models trained on synthetic data can closely rival those trained on voluminous clean data. These insights open avenues for future research in synthetic data generation and filtering techniques, aiming to democratize language technology across a broader spectrum of languages.

Furthermore, the release of IndicMonoDoc offers a valuable resource that could catalyze advancements in LLMing for a wide array of languages, marking a significant step toward bridging the performance gap between high and low-resource languages. As the field progresses, exploring the scalability of these techniques to larger models and their applicability across more diverse language families will be crucial in achieving genuinely global language technology.

PDF Markdown

Tweets

https://twitter.com/prajdabre1/status/1771036077282967567

https://twitter.com/prajdabre1/status/1797795145963950421

https://twitter.com/prajdabre1/status/1797794019323576720

https://twitter.com/arxivsanitybot/status/1771356147984879748

YouTube

Show All Videos