Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The Role of Language Imbalance in Cross-lingual Generalisation: Insights from Cloned Language Experiments (2404.07982v4)

Published 11 Apr 2024 in cs.CL and cs.LG

Abstract: Multilinguality is crucial for extending recent advancements in LLMling to diverse linguistic communities. To maintain high performance while representing multiple languages, multilingual models ideally align representations, allowing what is learned in one language to generalise to others. Prior research has emphasised the importance of parallel data and shared vocabulary elements as key factors for such alignment. In this study, we investigate an unintuitive novel driver of cross-lingual generalisation: language imbalance. In controlled experiments on perfectly equivalent cloned languages, we observe that the existence of a predominant language during training boosts the performance of less frequent languages and leads to stronger alignment of model representations across languages. Furthermore, we find that this trend is amplified with scale: with large enough models or long enough training, we observe that bilingual training data with a 90/10 language split yields better performance on both languages than a balanced 50/50 split. Building on these insights, we design training schemes that can improve performance in all cloned languages, even without altering the training data. As we extend our analysis to real languages, we find that infrequent languages still benefit from frequent ones, yet whether language imbalance causes cross-lingual generalisation there is not conclusive.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (38)
  1. Do all languages cost the same? Tokenization in the era of commercial language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  9904–9923, Singapore, 2023. Association for Computational Linguistics. URL https://aclanthology.org/2023.emnlp-main.614.
  2. The hidden space of transformer language adapters. arXiv preprint arXiv:2402.13137, 2024. URL https://arxiv.org/abs/2402.13137.
  3. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pp.  1877–1901. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
  4. Thread: Circuits. Distill, 2020. URL https://distill.pub/2020/circuits.
  5. Word acquisition in neural language models. Transactions of the Association for Computational Linguistics, 10:1–16, 2022. URL https://aclanthology.org/2022.tacl-1.1.
  6. When is multilinguality a curse? language modeling for 250 high-and low-resource languages. arXiv preprint arXiv:2311.09205, 2023. URL https://arxiv.org/pdf/2311.09205.pdf.
  7. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  8440–8451. Association for Computational Linguistics, July 2020a. URL https://aclanthology.org/2020.acl-main.747.
  8. Emerging cross-lingual structure in pretrained language models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  6022–6034. Association for Computational Linguistics, July 2020b. URL https://aclanthology.org/2020.acl-main.536.
  9. Identifying elements essential for BERT’s multilinguality. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  4423–4437. Association for Computational Linguistics, November 2020. URL https://aclanthology.org/2020.emnlp-main.358.
  10. A mathematical framework for transformer circuits. Transformer Circuits Thread, 2021. URL https://transformer-circuits.pub/2021/framework/index.html.
  11. Toward the limitation of code-switching in cross-lingual transfer. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  5966–5971. Association for Computational Linguistics, December 2022. URL https://aclanthology.org/2022.emnlp-main.400.
  12. Robert M. French. Catastrophic forgetting in connectionist networks. Trends in Cognitive Sciences, 3(4):128–135, 1999. URL https://www.sciencedirect.com/science/article/pii/S1364661399012942.
  13. Philip Gage. A new algorithm for data compression. C Users Journal, 12(2):23–38, 1994.
  14. The Pile: An 800GB dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020. URL https://arxiv.org/pdf/2101.00027.pdf.
  15. How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model. Advances in Neural Information Processing Systems, 36, 2024. URL https://arxiv.org/abs/2305.00586.pdf.
  16. An empirical analysis of compute-optimal large language model training. In Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=iBBcRUlOAPR.
  17. Lexinvariant language models. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=NiQTy0NW1L.
  18. Cross-lingual ability of multilingual BERT: An empirical study. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=HJeT3yrtDr.
  19. How BPE affects memorization in transformers. arXiv preprint arXiv:2110.02782, 2021. URL https://arxiv.org/abs/2110.02782.
  20. Adam: A method for stochastic optimization. In International Conference on Learning Representations, San Diego, CA, USA, 2015.
  21. Philipp Koehn. Europarl: A parallel corpus for statistical machine translation. In Proceedings of Machine Translation Summit X: Papers, pp. 79–86, Phuket, Thailand, September 13-15 2005. URL https://aclanthology.org/2005.mtsummit-papers.11.
  22. OpenAssistant conversations - democratizing large language model alignment. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. URL https://openreview.net/forum?id=VSJotgbPHF.
  23. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp.  66–71, 2018. URL https://arxiv.org/pdf/1808.06226.pdf.
  24. Cross-lingual language model pretraining. arXiv preprint arXiv:1901.07291, 2019. URL https://arxiv.org/abs/1901.07291.
  25. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of Learning and Motivation, volume 24, pp. 109–165. Academic Press, 1989. URL https://www.sciencedirect.com/science/article/pii/S0079742108605368.
  26. Efficient estimation of word representations in vector space. In 1st International Conference on Learning Representations, Workshop Track Proceedings, Scottsdale, Arizona, USA, 2013. URL http://arxiv.org/abs/1301.3781.
  27. Lifting the curse of multilinguality by pre-training modular transformers. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  3479–3495, Seattle, United States, July 2022. Association for Computational Linguistics. URL https://aclanthology.org/2022.naacl-main.255.
  28. PleIAs. French-PD-Books dataset. https://huggingface.co/datasets/PleIAs/French-PD-Books, 2024. Accessed in 01/2024, Hugging Face Datasets library.
  29. Language models are unsupervised multitask learners. OpenAI Blog, 2019. URL https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf.
  30. PARADISE: Exploiting parallel data for multilingual sequence-to-sequence pretraining. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  800–810, Seattle, United States, July 2022. Association for Computational Linguistics. URL https://aclanthology.org/2022.naacl-main.58.
  31. BLOOM: A 176B-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2023. URL https://arxiv.org/abs/2211.05100.
  32. On the effect of (near) duplicate subwords in language modelling. arXiv preprint arXiv:2404.06508, 2024. URL https://arxiv.org/abs/2404.06508.
  33. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  1715–1725, Berlin, Germany, August 2016. Association for Computational Linguistics. URL https://aclanthology.org/P16-1162.
  34. The languini kitchen: Enabling language modelling research at different scales of compute. arXiv preprint arXiv:2309.11197, 2023. URL https://arxiv.org/abs/2309.11197.
  35. LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a. URL https://arxiv.org/abs/2302.13971.
  36. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b. URL https://arxiv.org/pdf/2307.09288.pdf.
  37. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=rJ4km2R5t7.
  38. Do Llamas work in English? On the latent language of multilingual transformers. arXiv preprint arXiv:2402.10588, 2024. URL https://arxiv.org/abs/2402.10588.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Anton Schäfer (3 papers)
  2. Shauli Ravfogel (38 papers)
  3. Thomas Hofmann (121 papers)
  4. Tiago Pimentel (55 papers)
  5. Imanol Schlag (20 papers)
Citations (1)

Summary

Unveiling the Impact of Language Imbalance on Cross-Lingual Generalisation in Multilingual Models

Introduction to Cross-Lingual Generalisation in LLMs

Recent advances in LLM (LM) development have significantly improved performance across a plethora of natural language processing tasks. However, the multilingual capabilities of these models remain a critical concern, especially for ensuring that advancements benefit users across diverse linguistic backgrounds. The crux of enhancing multilingual models lies in their ability to exhibit cross-lingual generalisation, where the knowledge gleaned from one language aids in understanding or performing tasks in another. This paper explores the effects of language imbalance on cross-lingual generalisation, presenting novel insights on how the distribution of languages during model training influences generalisation and representation alignment across languages.

Exploring Cross-Lingual Generalisation

Cross-lingual generalisation hinges on the model's ability to leverage linguistic similarities and shared structures between languages. Previous research has highlighted the role of parallel data and shared vocabulary elements in promoting such generalisation. However, this paper introduces an often-overlooked aspect—language imbalance—as a potent driver of cross-lingual learning. Through experimental analyses using cloned and actual language pairs, the paper demonstrates that a predominant language during training can enhance the performance of less frequently represented languages, suggesting a complex interplay between language sampling strategies and model learning dynamics.

Investigating Language Imbalance in Cloned Languages

In a controlled setting using cloned languages (i.e., languages artificially created to have identical grammar and semantics but different vocabularies), the paper observes that language imbalance—having a dominant main language during training—improves generalisation to less frequent languages. Notably, as model size and training duration increase, this effect is amplified, challenging the conventional wisdom advocating for balanced multilingual training sets. Furthermore, through the introduction of specific training schemes, the paper illustrates that performance can be optimised across all languages without modifying existing training data, shedding light on the strategic implications of language imbalance for model training.

Extension to Real Languages

Transitioning from cloned to real languages, the paper examines whether the phenomena observed extend to natural linguistic pairs, such as English and French. Although similar trends concerning the benefits of a high-resource language on lower-resource counterparts are identifiable, the direct impact of language imbalance on cross-lingual generalisation becomes less clear. This discrepancy suggests the influence of additional factors inherent to natural languages, such as linguistic diversity and complexity, which may moderate the effect of language imbalance on model performance.

Implications and Future Directions

This research underscores the multifaceted nature of language learning dynamics within multilingual models, particularly the unintuitive role of language imbalance in enhancing cross-lingual generalisation. From a practical standpoint, these findings prompt a reevaluation of existing training paradigms, suggesting that imbalanced language representation may, under certain conditions, be beneficial. Looking ahead, this opens avenues for further inquiry into tailored training regimes that exploit language imbalance, with the potential to refine model performance across a broader spectrum of languages. Moreover, the nuanced understanding of generalisation mechanisms invites deeper exploration into model architecture and training strategies that can accommodate the intricate demands of multilingual representation and learning.

Conclusion

The paper presents compelling evidence that language imbalance during training can serve as a catalyst for cross-lingual generalisation in LLMs. By charting the territory beyond conventional training strategies, this research offers valuable insights into the complex dynamics of multilingual learning, setting the stage for future innovations in the development of more inclusive and effective language technologies.

Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com