The Role of Language Imbalance in Cross-lingual Generalisation: Insights from Cloned Language Experiments (2404.07982v4)
Abstract: Multilinguality is crucial for extending recent advancements in LLMling to diverse linguistic communities. To maintain high performance while representing multiple languages, multilingual models ideally align representations, allowing what is learned in one language to generalise to others. Prior research has emphasised the importance of parallel data and shared vocabulary elements as key factors for such alignment. In this study, we investigate an unintuitive novel driver of cross-lingual generalisation: language imbalance. In controlled experiments on perfectly equivalent cloned languages, we observe that the existence of a predominant language during training boosts the performance of less frequent languages and leads to stronger alignment of model representations across languages. Furthermore, we find that this trend is amplified with scale: with large enough models or long enough training, we observe that bilingual training data with a 90/10 language split yields better performance on both languages than a balanced 50/50 split. Building on these insights, we design training schemes that can improve performance in all cloned languages, even without altering the training data. As we extend our analysis to real languages, we find that infrequent languages still benefit from frequent ones, yet whether language imbalance causes cross-lingual generalisation there is not conclusive.
- Do all languages cost the same? Tokenization in the era of commercial language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 9904–9923, Singapore, 2023. Association for Computational Linguistics. URL https://aclanthology.org/2023.emnlp-main.614.
- The hidden space of transformer language adapters. arXiv preprint arXiv:2402.13137, 2024. URL https://arxiv.org/abs/2402.13137.
- Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pp. 1877–1901. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
- Thread: Circuits. Distill, 2020. URL https://distill.pub/2020/circuits.
- Word acquisition in neural language models. Transactions of the Association for Computational Linguistics, 10:1–16, 2022. URL https://aclanthology.org/2022.tacl-1.1.
- When is multilinguality a curse? language modeling for 250 high-and low-resource languages. arXiv preprint arXiv:2311.09205, 2023. URL https://arxiv.org/pdf/2311.09205.pdf.
- Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 8440–8451. Association for Computational Linguistics, July 2020a. URL https://aclanthology.org/2020.acl-main.747.
- Emerging cross-lingual structure in pretrained language models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 6022–6034. Association for Computational Linguistics, July 2020b. URL https://aclanthology.org/2020.acl-main.536.
- Identifying elements essential for BERT’s multilinguality. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 4423–4437. Association for Computational Linguistics, November 2020. URL https://aclanthology.org/2020.emnlp-main.358.
- A mathematical framework for transformer circuits. Transformer Circuits Thread, 2021. URL https://transformer-circuits.pub/2021/framework/index.html.
- Toward the limitation of code-switching in cross-lingual transfer. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 5966–5971. Association for Computational Linguistics, December 2022. URL https://aclanthology.org/2022.emnlp-main.400.
- Robert M. French. Catastrophic forgetting in connectionist networks. Trends in Cognitive Sciences, 3(4):128–135, 1999. URL https://www.sciencedirect.com/science/article/pii/S1364661399012942.
- Philip Gage. A new algorithm for data compression. C Users Journal, 12(2):23–38, 1994.
- The Pile: An 800GB dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020. URL https://arxiv.org/pdf/2101.00027.pdf.
- How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model. Advances in Neural Information Processing Systems, 36, 2024. URL https://arxiv.org/abs/2305.00586.pdf.
- An empirical analysis of compute-optimal large language model training. In Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=iBBcRUlOAPR.
- Lexinvariant language models. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=NiQTy0NW1L.
- Cross-lingual ability of multilingual BERT: An empirical study. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=HJeT3yrtDr.
- How BPE affects memorization in transformers. arXiv preprint arXiv:2110.02782, 2021. URL https://arxiv.org/abs/2110.02782.
- Adam: A method for stochastic optimization. In International Conference on Learning Representations, San Diego, CA, USA, 2015.
- Philipp Koehn. Europarl: A parallel corpus for statistical machine translation. In Proceedings of Machine Translation Summit X: Papers, pp. 79–86, Phuket, Thailand, September 13-15 2005. URL https://aclanthology.org/2005.mtsummit-papers.11.
- OpenAssistant conversations - democratizing large language model alignment. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. URL https://openreview.net/forum?id=VSJotgbPHF.
- SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 66–71, 2018. URL https://arxiv.org/pdf/1808.06226.pdf.
- Cross-lingual language model pretraining. arXiv preprint arXiv:1901.07291, 2019. URL https://arxiv.org/abs/1901.07291.
- Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of Learning and Motivation, volume 24, pp. 109–165. Academic Press, 1989. URL https://www.sciencedirect.com/science/article/pii/S0079742108605368.
- Efficient estimation of word representations in vector space. In 1st International Conference on Learning Representations, Workshop Track Proceedings, Scottsdale, Arizona, USA, 2013. URL http://arxiv.org/abs/1301.3781.
- Lifting the curse of multilinguality by pre-training modular transformers. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 3479–3495, Seattle, United States, July 2022. Association for Computational Linguistics. URL https://aclanthology.org/2022.naacl-main.255.
- PleIAs. French-PD-Books dataset. https://huggingface.co/datasets/PleIAs/French-PD-Books, 2024. Accessed in 01/2024, Hugging Face Datasets library.
- Language models are unsupervised multitask learners. OpenAI Blog, 2019. URL https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf.
- PARADISE: Exploiting parallel data for multilingual sequence-to-sequence pretraining. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 800–810, Seattle, United States, July 2022. Association for Computational Linguistics. URL https://aclanthology.org/2022.naacl-main.58.
- BLOOM: A 176B-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2023. URL https://arxiv.org/abs/2211.05100.
- On the effect of (near) duplicate subwords in language modelling. arXiv preprint arXiv:2404.06508, 2024. URL https://arxiv.org/abs/2404.06508.
- Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1715–1725, Berlin, Germany, August 2016. Association for Computational Linguistics. URL https://aclanthology.org/P16-1162.
- The languini kitchen: Enabling language modelling research at different scales of compute. arXiv preprint arXiv:2309.11197, 2023. URL https://arxiv.org/abs/2309.11197.
- LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a. URL https://arxiv.org/abs/2302.13971.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b. URL https://arxiv.org/pdf/2307.09288.pdf.
- GLUE: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=rJ4km2R5t7.
- Do Llamas work in English? On the latent language of multilingual transformers. arXiv preprint arXiv:2402.10588, 2024. URL https://arxiv.org/abs/2402.10588.
- Anton Schäfer (3 papers)
- Shauli Ravfogel (38 papers)
- Thomas Hofmann (121 papers)
- Tiago Pimentel (55 papers)
- Imanol Schlag (20 papers)