Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LEIA: Facilitating Cross-lingual Knowledge Transfer in Language Models with Entity-based Data Augmentation (2402.11485v2)

Published 18 Feb 2024 in cs.CL, cs.AI, and cs.LG

Abstract: Adapting English-based LLMs to other languages has become increasingly popular due to the efficiency and potential of cross-lingual transfer. However, existing language adaptation methods often overlook the benefits of cross-lingual supervision. In this study, we introduce LEIA, a language adaptation tuning method that utilizes Wikipedia entity names aligned across languages. This method involves augmenting the target language corpus with English entity names and training the model using left-to-right LLMing. We assess LEIA on diverse question answering datasets using 7B-parameter LLMs, demonstrating significant performance gains across various non-English languages. The source code is available at https://github.com/studio-ousia/leia.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. MEGA: Multilingual evaluation of generative AI. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4232–4267, Singapore. Association for Computational Linguistics.
  2. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
  3. Parsing with multilingual BERT, a small corpus, and a small treebank. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1324–1334, Online. Association for Computational Linguistics.
  4. CODAH: An adversarially-authored question answering dataset for common sense. In Proceedings of the 3rd Workshop on Evaluating Vector Space Representations for NLP, pages 63–69, Minneapolis, USA. Association for Computational Linguistics.
  5. Alexis Conneau and Guillaume Lample. 2019. Cross-lingual language model pretraining. In Proceedings of the Advances in Neural Information Processing Systems 32, pages 7057–7067, Vancouver, BC, Canada.
  6. Emerging cross-lingual structure in pretrained language models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6022–6034, Online. Association for Computational Linguistics.
  7. Efficient and effective text encoding for Chinese LLaMA and Alpaca. ArXiv, abs/2304.08177.
  8. Tri Dao. 2023. FlashAttention-2: Faster attention with better parallelism and work partitioning. ArXiv, abs/2307.08691.
  9. Do multilingual language models think better in english? ArXiv, abs/2308.01223.
  10. Building a robust large-scale language model for Japanese through continual pretraining [継続事前学習による日本語に強い大規模言語モデルの構築] (in Japanese). In NLP 2024.
  11. Don’t stop pretraining: Adapt language models to domains and tasks. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8342–8360, Online. Association for Computational Linguistics.
  12. Automatic evaluation tool for Japanese large language models [llm-jp-eval: 日本語大規模言語モデルの自動評価ツール] (in Japanese). In NLP 2024.
  13. Xiaochuang Han and Jacob Eisenstein. 2019. Unsupervised domain adaptation of contextualized embeddings for sequence labeling. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4238–4248, Hong Kong, China. Association for Computational Linguistics.
  14. Not all languages are created equal in LLMs: Improving multilingual capability by cross-lingual-thought prompting. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 12365–12394, Singapore. Association for Computational Linguistics.
  15. Construction of a Japanese multi-hop QA dataset for a question-answering system that can explain its reasons [根拠を説明可能な質問応答システムのための日本語マルチホップQAデータセット構築] (in Japanese). In NLP 2023.
  16. XLM-K: Improving cross-lingual language model pre-training with multilingual knowledge. In Proceedings of the Thirty-Sixth AAAI Conference on Artificial Intelligence.
  17. The state and fate of linguistic diversity and inclusion in the NLP world. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6282–6293, Online. Association for Computational Linguistics.
  18. nmT5 - is parallel data still relevant for pre-training massively multilingual language models? In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 683–691, Online. Association for Computational Linguistics.
  19. JGLUE: Japanese general language understanding evaluation. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 2957–2966, Marseille, France. European Language Resources Association.
  20. Common sense beyond English: Evaluating and improving multilingual language models for commonsense reasoning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1274–1287, Online. Association for Computational Linguistics.
  21. Ilya Loshchilov and Frank Hutter. 2017. Fixing weight decay regularization in Adam. ArXiv, abs/1711.05101.
  22. Martin Müller and Florian Laurent. 2022. Cedille: A large autoregressive French language model. ArXiv, abs/2202.03371.
  23. Graham Neubig and Junjie Hu. 2018. Rapid adaptation of neural machine translation to new languages. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 875–880, Brussels, Belgium. Association for Computational Linguistics.
  24. SeaLLMs - large language models for Southeast Asia. ArXiv, abs/2312.00738.
  25. ZeRO: Memory optimizations toward training trillion parameter models. SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16.
  26. Machel Reid and Mikel Artetxe. 2022. PARADISE: Exploiting parallel data for multilingual sequence-to-sequence pretraining. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 800–810, Seattle, United States. Association for Computational Linguistics.
  27. mLUKE: The power of entity representations in multilingual pretrained language models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7316–7330, Dublin, Ireland. Association for Computational Linguistics.
  28. Satoshi Sekine. 2003. Development of a question answering system focused on an encyclopedia [百科事典を対象とした質問応答システムの開発] (in Japanese). In NLP 2003.
  29. JAQKET: Constructing a Japanese QA dataset based on quiz questions [JAQKET:クイズを題材にした日本語QAデータセットの構築] (in Japanese). In NLP 2020.
  30. CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158, Minneapolis, Minnesota. Association for Computational Linguistics.
  31. Llama 2: Open foundation and fine-tuned chat models. ArXiv, abs/2307.09288.
  32. Expanding pretrained models to thousands more languages via lexicon-based adaptation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 863–877, Dublin, Ireland. Association for Computational Linguistics.
  33. Extending multilingual BERT to low-resource languages. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 2649–2656, Online. Association for Computational Linguistics.
  34. BLOOM+1: Adding language support to BLOOM for zero-shot prompting. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11682–11703, Toronto, Canada. Association for Computational Linguistics.
  35. LLaMA beyond English: An empirical study on language capability transfer. ArXiv, abs/2401.01055.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com