Do Not Worry if You Do Not Have Data: Building Pretrained Language Models Using Translationese (2403.13638v2)
Abstract: In this paper, we explore the utility of Translationese as synthetic data created using machine translation for pre-training LLMs (LMs). Pre-training requires vast amounts of monolingual data, which is mostly unavailable for languages other than English. Recently, there has been a growing interest in using synthetic data to address this data scarcity. We take the case of English and Indic languages and translate web-crawled monolingual documents (clean) into the target language. Then, we train LLMs containing 28M and 85M parameters on this translationese data (synthetic). We show that their performance on downstream natural language understanding and generative tasks is only 3.56% poorer on NLU tasks and 1.51% on NLG tasks than LMs pre-trained on clean data. Further, we propose the use of lightweight TinyLMs pre-trained on clean data to filter synthetic data efficiently which significantly improves the performance of our models. We also find that LMs trained on synthetic data strongly benefit from extended pretraining on a tiny fraction (10%) of clean data. We release the data we collected and created as a part of this work, IndicMonoDoc, the largest collection of monolingual document-level corpora, which we hope will help bridge the gap between English and non-English performance for LLMs.
- Towards a Cleaner Document-Oriented Multilingual Crawled Corpus. arXiv e-prints, page arXiv:2201.06642.
- IndicXNLI: Evaluating multilingual inference for Indian languages. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10994–11006, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- The falcon series of open language models.
- Nikolay Bogoychev and Rico Sennrich. 2019. Domain, translationese and noise in synthetic data for neural machine translation. CoRR, abs/1911.03362.
- Hindencorp-hindi-english and hindi-only corpus for machine translation. In LREC, pages 3550–3555.
- Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
- DialogSum: A real-life scenario dialogue summarization dataset. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 5062–5074, Online. Association for Computational Linguistics.
- A survey of multilingual neural machine translation. ACM Comput. Surv., 53(5).
- Daniel Deutsch and Dan Roth. 2020. SacreROUGE: An open-source library for using and developing summarization evaluation metrics. In Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS), pages 120–125, Online. Association for Computational Linguistics.
- Towards leaving no indic language behind: Building monolingual corpora, benchmark and models for indic languages. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12402–12426.
- Understanding back-translation at scale. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 489–500, Brussels, Belgium. Association for Computational Linguistics.
- Ronen Eldan and Yuanzhi Li. 2023. Tinystories: How small can language models be and still speak coherent english?
- Indictrans2: Towards high-quality and accessible machine translation models for all 22 scheduled indian languages. Transactions on Machine Learning Research.
- Martin Gellerstam. 1986. Translationese in swedish novels translated from english. Translation studies in Scandinavia, 1:88–95.
- Building large monolingual dictionaries at the Leipzig corpora collection: From 100 to 200 languages. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), pages 759–765, Istanbul, Turkey. European Language Resources Association (ELRA).
- Native language identification with user generated content. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3591–3601, Brussels, Belgium. Association for Computational Linguistics.
- Translationese in machine translation evaluation. CoRR, abs/1906.09833.
- XL-sum: Large-scale multilingual abstractive summarization for 44 languages. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 4693–4703, Online. Association for Computational Linguistics.
- Kenneth Heafield. 2011. KenLM: Faster and smaller language model queries. In Proceedings of the Sixth Workshop on Statistical Machine Translation, pages 187–197, Edinburgh, Scotland. Association for Computational Linguistics.
- Measuring massive multitask language understanding. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net.
- Training compute-optimal large language models.
- Indicnlpsuite: Monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for indian languages. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4948–4961.
- Yoon Kim and Alexander M. Rush. 2016. Sequence-level knowledge distillation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1317–1327, Austin, Texas. Association for Computational Linguistics.
- Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.
- Madlad-400: A multilingual and document-level large audited dataset. arXiv preprint arXiv:2309.04662.
- IndicNLG benchmark: Multilingual datasets for diverse NLG tasks in Indic languages. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 5363–5394, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- The bigscience roots corpus: A 1.6 tb composite multilingual dataset. Advances in Neural Information Processing Systems, 35:31809–31826.
- Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
- Few-shot learning with multilingual generative language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9019–9052, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net.
- Bhasha-abhijnaanam: Native-script and romanized language identification for 22 indic languages.
- Aksharantar: Open Indic-language transliteration datasets and models for the next billion users. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 40–57, Singapore. Association for Computational Linguistics.
- Tagged back-translation revisited: Why does it really work? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5990–5997, Online. Association for Computational Linguistics.
- On faithfulness and factuality in abstractive summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1906–1919, Online. Association for Computational Linguistics.
- Emille: Building a corpus of south asian languages. In Proceedings of the International Conference on Machine Translation and Multilingual Applications in the new Millennium: MT 2000.
- Revisiting round-trip translation for quality estimation. In Proceedings of the 22nd Annual Conference of the European Association for Machine Translation, pages 91–104, Lisboa, Portugal. European Association for Machine Translation.
- Abstractive text summarization using sequence-to-sequence rnns and beyond. arXiv preprint arXiv:1602.06023.
- Original or translated? a causal analysis of the impact of translationese on machine translation performance. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5303–5320, Seattle, United States. Association for Computational Linguistics.
- Asynchronous pipelines for processing huge corpora on medium to low resource infrastructures. Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-7) 2019. Cardiff, 22nd July 2019, pages 9 – 16, Mannheim. Leibniz-Institut für Deutsche Sprache.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
- Maja Popović. 2017. chrF++: words helping character n-grams. In Proceedings of the Second Conference on Machine Translation, pages 612–618, Copenhagen, Denmark. Association for Computational Linguistics.
- Neural machine translation for translating into Croatian and Serbian. In Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects, pages 102–113, Barcelona, Spain (Online). International Committee on Computational Linguistics (ICCL).
- T5Score: Discriminative fine-tuning of generative evaluation metrics. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 15185–15202, Singapore. Association for Computational Linguistics.
- Ella Rabinovich and Shuly Wintner. 2015. Unsupervised identification of translationese. Transactions of the Association for Computational Linguistics, 3:419–432.
- Scaling language models: Methods, analysis & insights from training gopher.
- Samanantar: The largest publicly available parallel corpora collection for 11 indic languages. Transactions of the Association for Computational Linguistics, 10:145–162.
- Are references really needed? unbabel-IST 2021 submission for the metrics shared task. In Proceedings of the Sixth Conference on Machine Translation, pages 1030–1040, Online. Association for Computational Linguistics.
- COMET: A neural framework for MT evaluation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2685–2702, Online. Association for Computational Linguistics.
- Improving neural machine translation models with monolingual data. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 86–96, Berlin, Germany. Association for Computational Linguistics.
- Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers. The Association for Computer Linguistics.
- Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. CoRR, abs/2206.04615.
- Roformer: Enhanced transformer with rotary position embedding.
- No language left behind: Scaling human-centered machine translation.
- The fact extraction and VERification (FEVER) shared task. In Proceedings of the First Workshop on Fact Extraction and VERification (FEVER), pages 1–9, Brussels, Belgium. Association for Computational Linguistics.
- Attaining the unattainable? reassessing claims of human parity in neural machine translation. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 113–123, Brussels, Belgium. Association for Computational Linguistics.
- Attention is all you need. Advances in neural information processing systems, 30.
- Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461.
- Findings of the BabyLM challenge: Sample-efficient pretraining on developmentally plausible corpora. In Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning, pages 1–34, Singapore. Association for Computational Linguistics.
- Neural network acceptability judgments. Transactions of the Association for Computational Linguistics, 7:625–641.
- Bloom: A 176b-parameter open-access multilingual language model.
- mT5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 483–498, Online. Association for Computational Linguistics.
- Bartscore: Evaluating generated text as text generation. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 27263–27277.
- Mike Zhang and Antonio Toral. 2019. The effect of translationese in machine translation test sets. CoRR, abs/1906.08069.
- Meet Doshi (4 papers)
- Raj Dabre (65 papers)
- Pushpak Bhattacharyya (153 papers)