Emergent Mind

Poro 34B and the Blessing of Multilinguality

(2404.01856)
Published Apr 2, 2024 in cs.CL

Abstract

The pretraining of state-of-the-art LLMs now requires trillions of words of text, which is orders of magnitude more than available for the vast majority of languages. While including text in more than one language is an obvious way to acquire more pretraining data, multilinguality is often seen as a curse, and most model training efforts continue to focus near-exclusively on individual large languages. We believe that multilinguality can be a blessing and that it should be possible to substantially improve over the capabilities of monolingual models for small languages through multilingual training. In this study, we introduce Poro 34B, a 34 billion parameter model trained for 1 trillion tokens of Finnish, English, and programming languages, and demonstrate that a multilingual training approach can produce a model that not only substantially advances over the capabilities of existing models for Finnish, but also excels in translation and is competitive in its class in generating English and programming languages. We release the model parameters, scripts, and data under open licenses at https://huggingface.co/LumiOpen/Poro-34B.

Poro~34B's performance on FIN-bench compared to FinGPT and BLUUMI models.

Overview

  • Poro~34B is a multilingual generative language model with 34 billion parameters designed to enhance Finnish language processing and achieve competitive performance in English and programming languages.

  • The model was trained on a diverse pretraining corpus of 1 trillion tokens, including Finnish, English, programming languages, and Finnish-English translation pairs to facilitate cross-lingual capabilities.

  • Poro~34B leverages a decoder-only architecture and was evaluated on tasks across Finnish, English, and code generation, showcasing superior performance in Finnish and competitive capabilities in other languages.

  • The study highlights the benefits of multilingual training for languages with limited resources and suggests future research directions for scaling this approach and further exploring its advantages in translation and generative tasks.

Poro~34B: Advancing Language Model Capabilities for Finnish through Multilingual Training

Introduction

The development of large-scale generative language models has increasingly faced the challenge of data scarcity, particularly for languages other than English. This study introduces Poro~34B, a model that leverages multilingual training to not only enhance capabilities in Finnish - a relatively low-resource language in the context of LLMs - but also to demonstrate competitiveness in English and programming language tasks. By training across Finnish, English, and various programming languages, Poro~34B addresses both the theoretical and practical aspects of multilingual training, challenging the prevailing view of multilinguality as detrimental to performance in specific languages.

Pretraining Data

The pretraining corpus for Poro~34B spanned 1 trillion tokens, comprising Finnish, English, and programming language data, with a specific emphasis on including high-quality, deduplicated, and filtered datasets to maximize the model's learning potential. The Finnish data, comprising 32 billion tokens, was sourced from a combination of web crawls and various curated datasets. English and programming languages accounted for the majority of the training data, with the inclusion of code aiming to provide syntactic diversity. An explicit cross-lingual signal was incorporated through Finnish-English translation pairs, facilitating the model’s capabilities in both languages and translation tasks.

Methodology

The Poro~34B model, with its 34 billion parameters, employs a decoder-only architecture mirroring advancements in generative model design. The training progressed over 1 trillion tokens, exceeding the current recommendations for compute-optimal training, underscoring a strategic choice for enhancing inference efficiency. The tokenizer, custom-designed for the multilingual corpus, aimed at low fertility rates across the languages of interest, suggesting effectiveness in tokenization which is fundamental for generative tasks. Key training parameters and architecture details were aligned with best practices, while also optimally configuring the model for the unique multilingual training approach undertaken.

Model Evaluation

Poro~34B underwent comprehensive evaluation across Finnish, English, and code generation tasks, demonstrating superior performance in Finnish and competitive results in English and programming languages. The model’s prowess in Finnish showcases the substantial advantage of multilingual training for enhancing performance in lower-resource languages. In English and code generation, Poro~34B displayed competitive capabilities, affirming the effectiveness of incorporating diverse linguistic data in training broad-spectrum generative models. The model also exhibited remarkable translation capabilities between English and Finnish, underscoring the beneficial impact of including translation pairs in training.

Practical Implications and Future Directions

The successful development of Poro~34B through multilingual training opens new vistas for creating robust large-scale language models for languages with relatively limited resources. The model’s adeptness across languages and various generative tasks highlights the potential of leveraging multilingual data to circumvent the constraints posed by data scarcity in specific languages. Future research could explore the scalability of this approach across additional languages and more deeply investigate the underlying factors contributing to the model's performance in translation and cross-linguistic generative tasks.

Conclusion

Poro~34B represents a significant advancement in the utilization of multilingual training to enhance language model capabilities beyond high-resource languages. By effectively leveraging a diverse pretraining corpus, the model has set new benchmarks in Finnish language processing, while also achieving noteworthy performance in English and programming languages. This study not only underscores the potential of multilingual training to expand the horizons of language model capabilities but also provides a blueprint for future explorations in leveraging multilingual data for language model development.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

References
  1. GPT-4 Technical Report
  2. Position Interpolation Improves ALiBi Extrapolation
  3. Tokenizer Choice For LLM Training: Negligible or Crucial?
  4. Falcon-40B: an open large language model with state-of-the-art performance. 2023a.
  5. The Falcon Series of Open Language Models
  6. Tower: An open multilingual large language model for translation-related tasks
  7. PaLM 2 Technical Report
  8. Program Synthesis with Large Language Models
  9. Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard

  10. A framework for the evaluation of code generation models. https://github.com/bigcode-project/bigcode-evaluation-harness

  11. Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp.  1877–1901. Curran Associates, Inc., 2020. https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.

  12. When is multilinguality a curse? language modeling for 250 high- and low-resource languages
  13. Evaluating large language models trained on code
  14. Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
  15. Training verifiers to solve math word problems
  16. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  8440–8451. Association for Computational Linguistics, 2020. doi: 10.18653/v1/2020.acl-main.747. https://aclanthology.org/2020.acl-main.747.

  17. No Language Left Behind: Scaling Human-Centered Machine Translation
  18. BERT: Pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. https://aclanthology.org/N19-1423.

  19. Lessons learned from GPT-SW3: Building the first large-scale generative language model for Swedish. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pp.  3509–3518, 2022. https://aclanthology.org/2022.lrec-1.376.

  20. Beyond english-centric multilingual machine translation. Journal of Machine Learning Research, 22(107):1–48
  21. Match the script, adapt if multilingual: Analyzing the effect of multilingual pretraining on cross-lingual transferability. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  1500–1512. Association for Computational Linguistics, 2022. doi: 10.18653/v1/2022.acl-long.106. https://aclanthology.org/2022.acl-long.106.

  22. A framework for few-shot language model evaluation, 12 2023. https://zenodo.org/records/10256836.

  23. The unreasonable effectiveness of few-shot learning for machine translation. In International Conference on Machine Learning, pp.  10867–10878. PMLR
  24. Continual learning under language shift
  25. The Flores-101 evaluation benchmark for low-resource and multilingual machine translation. Transactions of the Association for Computational Linguistics, 10:522–538, 2022. doi: 10.1162/tacl˙a˙00474. https://aclanthology.org/2022.tacl-1.30.

  26. OLMo: Accelerating the science of language models
  27. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR)
  28. Scaling Laws and Interpretability of Learning from Repeated Data
  29. Training Compute-Optimal Large Language Models
  30. An empirical analysis of compute-optimal large language model training. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems, volume 35, pp.  30016–30030. Curran Associates, Inc., 2022b. https://proceedings.neurips.cc/paper_files/paper/2022/file/c1e2faff6f588870935f114ebe04a3e5-Paper-Conference.pdf.

  31. Simple and scalable strategies to continually pre-train large language models
  32. The state and fate of linguistic diversity and inclusion in the NLP world. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  6282–6293, 2020. doi: 10.18653/v1/2020.acl-main.560. https://aclanthology.org/2020.acl-main.560.

  33. Turning english-centric LLMs into polyglots: How much multilinguality is needed?
  34. The stack: 3 TB of permissively licensed source code. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. https://openreview.net/forum?id=pxpbTdUEpD.

  35. BLOOM: A 176b-parameter open-access multilingual language model. 2022.
  36. Starcoder: may the source be with you!
  37. TruthfulQA: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  3214–3252, Dublin, Ireland, May 2022a. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.229. https://aclanthology.org/2022.acl-long.229.

  38. Few-shot learning with multilingual language models, 2022b
  39. Starcoder 2 and the stack v2: The next generation
  40. FinGPT: Large generative models for a small language. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  2710–2726. Association for Computational Linguistics, 2023. doi: 10.18653/v1/2023.emnlp-main.164. https://aclanthology.org/2023.emnlp-main.164.

  41. Language models of code are few-shot commonsense learners. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  1384–1403
  42. MosaicML. Introducing MPT-30B: Raising the bar for open-source foundation models, 2023. www.mosaicml.com/blog/mpt-30b.
  43. Scaling data-constrained language models. Advances in Neural Information Processing Systems, 36
  44. Language Model Tokenizers Introduce Unfairness Between Languages
  45. Lifting the curse of multilinguality by pre-training modular transformers. In Marine Carpuat, Marie-Catherine de Marneffe, and Ivan Vladimir Meza Ruiz (eds.), Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  3479–3495, Seattle, United States, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.naacl-main.255. https://aclanthology.org/2022.naacl-main.255.

  46. Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation
  47. WikiBERT models: Deep transfer learning for many languages. NoDaLiDa 2021, pp.  1
  48. Improving language understanding by generative pre-training. 2018.
  49. How good is your tokenizer? on the monolingual performance of multilingual language models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.  3118–3135, 2021. doi: 10.18653/v1/2021.acl-long.243. https://aclanthology.org/2021.acl-long.243.

  50. WinoGrande: An Adversarial Winograd Schema Challenge at Scale
  51. Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws
  52. SlimPajama: A 627B token cleaned and deduplicated version of RedPajama. https://www.cerebras.net/blog/slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama, 2023. https://huggingface.co/datasets/cerebras/SlimPajama-627B.

  53. Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research
  54. Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
  55. Jörg Tiedemann. News from OPUS-a collection of multilingual parallel corpora with tools and interfaces. In Recent advances in natural language processing, volume 5, pp.  237–248
  56. Jörg Tiedemann. The Tatoeba Translation Challenge – Realistic data sets for low resource and multilingual MT. In Proceedings of the Fifth Conference on Machine Translation, pp.  1174–1182, Online, November 2020. Association for Computational Linguistics. https://www.aclweb.org/anthology/2020.wmt-1.139.

  57. OPUS-MT – building open translation services for the world. In Proceedings of the 22nd Annual Conference of the European Association for Machine Translation, pp.  479–480, Lisboa, Portugal, November 2020. European Association for Machine Translation. https://aclanthology.org/2020.eamt-1.61.

  58. Together Computer. Redpajama: An open source recipe to reproduce llama training dataset, 2023. https://github.com/togethercomputer/RedPajama-Data.

  59. Llama: Open and efficient foundation language models
  60. Water security and climate change: hydropower reservoir greenhouse gas emissions. Water Security Under Climate Change, pp.  69–94
  61. Attention is all you need. Advances in neural information processing systems, 30
  62. Prompting palm for translation: Assessing strategies and performance. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  15406–15427
  63. Will we run out of data? An analysis of the limits of scaling datasets in Machine Learning
  64. PolyLM: An Open Source Polyglot Large Language Model
  65. HellaSwag: Can a machine really finish your sentence? In Anna Korhonen, David Traum, and Lluís Màrquez (eds.), Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.  4791–4800. Association for Computational Linguistics, 2019. doi: 10.18653/v1/P19-1472. https://aclanthology.org/P19-1472.

  66. LLaMA beyond english: An empirical study on language capability transfer
  67. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
  68. Multilingual machine translation with large language models: Empirical results and analysis

Show All 68