Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
98 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Survey on Large Language Models with Multilingualism: Recent Advances and New Frontiers (2405.10936v1)

Published 17 May 2024 in cs.CL and cs.AI
A Survey on Large Language Models with Multilingualism: Recent Advances and New Frontiers

Abstract: The rapid development of LLMs demonstrates remarkable multilingual capabilities in natural language processing, attracting global attention in both academia and industry. To mitigate potential discrimination and enhance the overall usability and accessibility for diverse language user groups, it is important for the development of language-fair technology. Despite the breakthroughs of LLMs, the investigation into the multilingual scenario remains insufficient, where a comprehensive survey to summarize recent approaches, developments, limitations, and potential solutions is desirable. To this end, we provide a survey with multiple perspectives on the utilization of LLMs in the multilingual scenario. We first rethink the transitions between previous and current research on pre-trained LLMs. Then we introduce several perspectives on the multilingualism of LLMs, including training and inference methods, model security, multi-domain with language culture, and usage of datasets. We also discuss the major challenges that arise in these aspects, along with possible solutions. Besides, we highlight future research directions that aim at further enhancing LLMs with multilingualism. The survey aims to help the research community address multilingual problems and provide a comprehensive understanding of the core concepts, key techniques, and latest developments in multilingual natural language processing based on LLMs.

Understanding Multilingual Capabilities in LLMs

LLMs like GPT-3, GPT-4, and LLaMA have transformed NLP in ways we could not have imagined a few years ago. Yet, despite their profound impact, there’s a significant gap when it comes to these models’ performance across different languages. A recent comprehensive survey aims to tackle this issue by exploring the multilingual capabilities of LLMs from various perspectives, such as training paradigms, inference strategies, security, and multi-domain applications.

Training Paradigms

From Scratch

Training LLMs from scratch with multilingual data involves incorporating diverse languages from the outset. A notable example here is XLM, which uses Translation LLMing (TLM) to enhance cross-lingual capabilities. Similarly, PolyLM employs curriculum learning to balance language data during pre-training. However, this approach underscores a crucial challenge: obtaining vast, high-quality multilingual datasets, especially for low-resource languages.

Continual Training

An efficient alternative to training from scratch is continual training on top of foundational models with new multilingual data. This method leverages existing knowledge while updating the model with additional language data. For instance, the BigTrans and Chinese-LLaMA models build on pre-trained models, improving their multilingual abilities without incurring the enormous costs of retraining. Nevertheless, this approach must solve catastrophic forgetting – where new knowledge interferes with old information – and deal with data scarcity in low-resource language settings.

Inference Strategies

Direct Inference

Direct inference, where models process text natively in multiple languages, is becoming more viable with advances in LLMs. Models like GPT-4 and PaLM-2 show promising results. Direct inference preserves linguistic nuances and ensures efficient processing by eliminating the translation step, but performance can still suffer in low-resource languages.

Pre-Translation

Pre-translation approaches convert input text into a high-resource language, like English, before processing. While this may allow models to leverage their strongest language proficiency, it introduces dependencies on high-quality translation tools and potential errors, which can distort meaning.

Multilingual CoT

Chain of Thought (CoT) strategies, initially successful in monolingual settings, are adapted for multilingual contexts. This involves instructions either in native languages or in English (e.g., "Let's think step by step"). The effectiveness varied, showing better results when instructional language is English.

Retrieval Augmented Generation (RAG)

RAG enhances LLMs by integrating external knowledge during text generation. This approach shows significant promise, especially for low-resource languages where models show a predisposition towards hallucinations or factual inaccuracies.

Code-Switching

Handling code-switching in multilingual dialogue settings, where speakers switch between languages, remains challenging for LLMs. Recent work shows that even powerful models struggle without tailored fine-tuning.

Security

Attack Methods

LLMs are vulnerable to various attacks, including jailbreaks, which trick them into bypassing safety protocols. Prompt-based methods, gradient-based methods like Greedy Coordinate Gradient (GCG), and multilingual-specific attacks expose these vulnerabilities. Certain languages, particularly low-resource ones, often bypass strict safety checks due to less extensive fine-tuning in those languages.

Defense Methods

Defense strategies range from enhanced training protocols to real-time input analyses, but there is no foolproof method yet. Methods like SmoothLLM show promise in perturbing input prompts to avoid generating unsafe outputs.

Multi-Domain Applications

In specialized fields like medicine and law, building effective multilingual models includes challenges that go beyond common LLM tasks. Models like MMedLM2 and BioMistral adapt LLMs to medical contexts across multiple languages, showing significant improvement within their domains. However, acquiring high-quality multilingual data remains a big hurdle, exacerbated by cultural and contextual intricacies unique to each language.

Data Resources and Benchmarking

The scarcity of large, high-quality multilingual datasets is a significant bottleneck. Datasets like MultiLegalPile and XMedBench offer initial steps towards bridging this gap. Comprehensive benchmarks that account for cultural and contextual factors across languages need to be developed to accurately reflect LLM performance in multilingual environments.

Bias and Fairness

Addressing biases in multilingual LLMs involves understanding both language-specific and demographic biases. While incremental improvements are made through techniques like up-sampling and adversarial training, the field still lacks robust tools and datasets to fully mitigate these biases.

Conclusion

Despite the impressive strides made in multilingual capabilities of LLMs, much work remains. Researchers must continue exploring advanced training strategies and inference methods while developing robust evaluation benchmarks and addressing biases to truly achieve language-fair AI. For both academia and industry, fostering collaboration and sharing resources will be crucial in overcoming these challenges and unlocking the full potential of LLMs in multilingual contexts.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (401)
  1. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.
  2. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  3. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  4. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9):1–35, 2023a.
  5. Linguistically motivated evaluation of the 2023 state-of-the-art machine translation: Can chatgpt outperform nmt? In Proceedings of the Eighth Conference on Machine Translation, pages 224–245, 2023.
  6. How good are gpt models at machine translation? a comprehensive evaluation. arXiv preprint arXiv:2302.09210, 2023.
  7. Can you translate for me? code-switched machine translation with large language models. In Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 2: Short Papers), pages 83–92, 2023.
  8. Chatgpt as a factual inconsistency evaluator for abstractive text summarization. arXiv preprint arXiv:2303.15621, 2023a.
  9. Is chatgpt a good nlg evaluator? a preliminary study. In Proceedings of EMNLP Workshop, page 1, 2023a.
  10. Application of chatgpt in improving customer sentiment analysis for businesses. Jurnal Teknologi Dan Sistem Informasi Bisnis, 5(3):283–288, 2023.
  11. Transforming sentiment analysis in the financial domain with chatgpt. Machine Learning with Applications, 14:100508, 2023.
  12. Fate-llm: A industrial grade federated learning framework for large language models. arXiv preprint arXiv:2310.10049, 2023.
  13. Healai: A healthcare llm for effective medical documentation. In Proceedings of the 17th ACM International Conference on Web Search and Data Mining, pages 1167–1168, 2024.
  14. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  15. Chatgpt beyond english: Towards a comprehensive evaluation of large language models in multilingual learning. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 13171–13189, 2023.
  16. Data augmentation using llms: Data perspectives, learning paradigms and challenges. arXiv preprint arXiv:2403.02990, 2024.
  17. Bigtrans: Augmenting large language models with multilingual translation capability over 100 languages. arXiv preprint arXiv:2305.18098, 2023a.
  18. Bloom: A 176b-parameter open-access multilingual language model, 2022.
  19. All languages matter: On the multilingual safety of large language models. arXiv preprint arXiv:2310.00905, 2023b.
  20. Bayling: Bridging cross-lingual alignment and instruction following through interactive translation for large language models. arXiv preprint arXiv:2306.10968, 2023a.
  21. Openchat: Advancing open-source language models with mixed-quality data. In The Twelfth International Conference on Learning Representations, 2023c.
  22. Enhancing multilingual capabilities of large language models through self-distillation from resource-rich languages. arXiv preprint arXiv:2402.12204, 2024a.
  23. When is multilinguality a curse? language modeling for 250 high-and low-resource languages. arXiv preprint arXiv:2311.09205, 2023.
  24. Quantifying multilingual performance of large language models across languages. arXiv preprint arXiv:2404.11553, 2024a.
  25. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526, 2017.
  26. Biogpt: generative pre-trained transformer for biomedical text generation and mining. Briefings in bioinformatics, 23(6):bbac409, 2022.
  27. Finbert: A pre-trained financial language representation model for financial text mining. In Proceedings of the twenty-ninth international conference on international joint conferences on artificial intelligence, pages 4513–4519, 2021.
  28. A survey on multilingual large language models: Corpora, alignment, and bias. arXiv preprint arXiv:2404.00929, 2024a.
  29. Multilingual large language model: A survey of resources, taxonomy and frontiers. arXiv preprint arXiv:2404.04925, 2024.
  30. Learn and consolidate: Continual adaptation for zero-shot and multilingual neural machine translation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 13938–13951, 2023a.
  31. Google’s multilingual neural machine translation system: Enabling zero-shot translation. Transactions of the Association for Computational Linguistics, 5:339–351, 2017.
  32. Knowledge transfer in incremental learning for multilingual neural machine translation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15286–15304, 2023b.
  33. Improving massively multilingual neural machine translation and zero-shot translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1628–1639, 2020.
  34. Learning language specific sub-network for multilingual machine translation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 293–305, 2021a.
  35. From bilingual to multilingual neural-based machine translation by incremental training. Journal of the Association for Information Science and Technology, 72(2):190–203, 2021.
  36. Recent advances in natural language processing via large pre-trained language models: A survey. ACM Computing Surveys, 56(2):1–40, 2023.
  37. On the sentence embeddings from pre-trained language models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9119–9130, 2020.
  38. A survey of knowledge enhanced pre-trained language models. IEEE Transactions on Knowledge and Data Engineering, 2023.
  39. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  40. Improving language understanding by generative pre-training, 2018.
  41. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880, 2020a.
  42. Pretrained language models for text generation: A survey. arXiv preprint arXiv:2201.05273, 2022.
  43. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  44. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  45. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  46. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022a.
  47. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023.
  48. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23(120):1–39, 2022.
  49. Cross-lingual language model pretraining. Advances in neural information processing systems, 32, 2019.
  50. Massively multilingual neural machine translation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 3874–3884, 2019.
  51. Multi-way, multilingual neural machine translation. Computer Speech and Language, 45(C):236–252, 2017.
  52. Towards multilingual neural question answering. In New Trends in Databases and Information Systems: ADBIS 2018 Short Papers and Workshops, AI* QA, BIGPMED, CSACDB, M2U, BigDataMAPS, ISTREND, DC, Budapest, Hungary, September, 2-5, 2018, Proceedings 22, pages 274–285. Springer, 2018.
  53. Towards end-to-end multilingual question answering. Information Systems Frontiers, 23(1):227–241, 2021.
  54. Multi-domain multilingual question answering. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: Tutorial Abstracts, pages 17–21, 2021.
  55. Common sense beyond english: Evaluating and improving multilingual language models for commonsense reasoning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1274–1287, 2021b.
  56. Leveraging knowledge in multilingual commonsense reasoning. In Findings of the Association for Computational Linguistics: ACL 2022, pages 3237–3246, 2022.
  57. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. In Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 675–718, 2023.
  58. The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv preprint arXiv:2309.17421, 9(1):1, 2023b.
  59. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  60. mt5: A massively multilingual pre-trained text-to-text transformer. arXiv preprint arXiv:2010.11934, 2020.
  61. Byt5: Towards a token-free future with pre-trained byte-to-byte models. Transactions of the Association for Computational Linguistics, 10:291–306, 2022.
  62. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446, 2021.
  63. Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239, 2022.
  64. mgpt: Few-shot learners go multilingual. arXiv preprint arXiv:2204.07580, 2022.
  65. Few-shot learning with multilingual generative language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9019–9052, 2022.
  66. Pangu: Towards trillion parameter language model with sparse heterogeneous computing. arXiv preprint arXiv:2303.10845, 10:11–15, 2023.
  67. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397–2430. PMLR, 2023.
  68. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
  69. InternLM Team. Internlm: A multilingual language model with progressively enhanced capabilities, 2023.
  70. Polylm: An open source polyglot large language model. arXiv preprint arXiv:2307.06018, 2023a.
  71. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305, 2023c.
  72. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
  73. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  74. Tigerbot: An open multilingual multitask llm. arXiv preprint arXiv:2312.08688, 2023a.
  75. Yayi 2: Multilingual open-source large language models. arXiv preprint arXiv:2312.14862, 2023b.
  76. Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954, 2024.
  77. Orion-14b: Open-source multilingual large language models. arXiv preprint arXiv:2401.12246, 2024a.
  78. Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219, 2024.
  79. AI Anthropic. The claude 3 model family: Opus, sonnet, haiku. Claude-3 Model Card, 2024.
  80. Internlm2 technical report. arXiv preprint arXiv:2403.17297, 2024.
  81. AI@Meta. Llama 3 model card, 2024. URL https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md.
  82. Scaling instruction-finetuned language models. Journal of Machine Learning Research, 25(70):1–53, 2024.
  83. Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414, 2022.
  84. Stanford alpaca: An instruction-following llama model, 2023.
  85. Parrot: Translating during chat using large language models tuned with human translation and feedback. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 15009–15020, 2023.
  86. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2(3):6, 2023.
  87. ZHIPU. Zhipu ai devday glm-4, 2024.
  88. Breaking the language barrier: Can direct inference outperform pre-translation in multilingual llm applications? arXiv preprint arXiv:2403.04792, 2024.
  89. Is translation all you need? a study on solving multilingual tasks with large language models. arXiv preprint arXiv:2403.10258, 2024a.
  90. Not all languages are created equal in llms: Improving multilingual capability by cross-lingual-thought prompting. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 12365–12394, 2023c.
  91. Language models are multilingual chain-of-thought reasoners. arXiv preprint arXiv:2210.03057, 2022a.
  92. The cot collection: Improving zero-shot and few-shot learning of language models via chain-of-thought fine-tuning. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12685–12708, 2023a.
  93. Challenging big-bench tasks and whether chain-of-thought can solve them. In Findings of the Association for Computational Linguistics: ACL 2023, pages 13003–13051, 2023.
  94. xcot: Cross-lingual instruction tuning for cross-lingual chain-of-thought reasoning. arXiv preprint arXiv:2401.07037, 2024.
  95. Leveraging multilingual knowledge graph to boost domain-specific entity translation of ChatGPT. In Masaru Yamada and Felix do Carmo, editors, Proceedings of Machine Translation Summit XIX, Vol. 2: Users Track, pages 77–87, Macau SAR, China, September 2023b. Asia-Pacific Association for Machine Translation. URL https://aclanthology.org/2023.mtsummit-users.7.
  96. XRICL: Cross-lingual retrieval-augmented in-context learning for cross-lingual text-to-SQL semantic parsing. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Findings of the Association for Computational Linguistics: EMNLP 2022, pages 5248–5259, Abu Dhabi, United Arab Emirates, December 2022b. Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-emnlp.384. URL https://aclanthology.org/2022.findings-emnlp.384.
  97. In-context examples selection for machine translation. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Findings of the Association for Computational Linguistics: ACL 2023, pages 8857–8873, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-acl.564. URL https://aclanthology.org/2023.findings-acl.564.
  98. From classification to generation: Insights into crosslingual retrieval augmented icl, 2023a.
  99. Crosslingual retrieval augmented in-context learning for Bangla. In Firoj Alam, Sudipta Kar, Shammur Absar Chowdhury, Farig Sadeque, and Ruhul Amin, editors, Proceedings of the First Workshop on Bangla Language Processing (BLP-2023), pages 136–151, Singapore, December 2023b. Association for Computational Linguistics. doi: 10.18653/v1/2023.banglalp-1.15. URL https://aclanthology.org/2023.banglalp-1.15.
  100. Multilingual few-shot learning via language model retrieval, 2023a.
  101. The unreasonable effectiveness of few-shot learning for machine translation, 2023.
  102. LMCap: Few-shot multilingual image captioning by retrieval augmented language model prompting. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Findings of the Association for Computational Linguistics: ACL 2023, pages 1635–1651, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-acl.104. URL https://aclanthology.org/2023.findings-acl.104.
  103. Boosting cross-lingual transferability in multilingual models via in-context learning, 2023b.
  104. Nomiracl: Knowing when you don’t know for robust multilingual retrieval-augmented generation, 2024.
  105. Mitigating hallucinations and off-target machine translation with source-contrastive and language-contrastive decoding. arXiv preprint arXiv:2309.07098, 2023.
  106. Don’t rank, combine! combining machine translation hypotheses using quality estimation. arXiv preprint arXiv:2401.06688, 2024.
  107. Relay decoding: Concatenating large language models for machine translation. arXiv preprint arXiv:2405.02933, 2024.
  108. Teaching large language models to translate with comparison. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 19488–19496, 2024.
  109. Exploring human-like translation strategy with large language models. Transactions of the Association for Computational Linguistics, 12:229–246, 2024.
  110. Increasing coverage and precision of textual information in multilingual knowledge graphs. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 1612–1634, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.100. URL https://aclanthology.org/2023.emnlp-main.100.
  111. Multilingual large language models are not (yet) code-switchers. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12567–12582, 2023c.
  112. Zero-shot sentiment analysis in low-resource languages using a multilingual sentiment lexicon. arXiv preprint arXiv:2402.02113, 2024.
  113. Prompting the hidden talent of web-scale speech models for zero-shot task generalization. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, volume 2023, pages 396–400, 2023a.
  114. Pal: Proxy-guided black-box attack on large language models, 2024.
  115. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023.
  116. Jailbroken: How does llm safety training fail? Advances in Neural Information Processing Systems, 36, 2024.
  117. Jailbreaking chatgpt via prompt engineering: An empirical study. arXiv preprint arXiv:2305.13860, 2023b.
  118. "do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models, 2023.
  119. Masterkey: Automated jailbreaking of large language model chatbots. In Proc. ISOC NDSS, 2024.
  120. Salad-bench: A hierarchical and comprehensive safety benchmark for large language models. arXiv preprint arXiv:2402.05044, 2024b.
  121. Autodan: Generating stealthy jailbreak prompts on aligned large language models. arXiv preprint arXiv:2310.04451, 2023c.
  122. Guard: Role-playing to generate natural-language jailbreakings to test guideline adherence of large language models. arXiv preprint arXiv:2402.03299, 2024.
  123. The language barrier: Dissecting safety challenges of llms in multilingual contexts, 2024.
  124. Multilingual jailbreak challenges in large language models. arXiv preprint arXiv:2310.06474, 2023.
  125. Comprehensive evaluation of chatgpt reliability through multilingual inquiries, 2023.
  126. Low-resource languages jailbreak gpt-4. arXiv preprint arXiv:2310.02446, 2023.
  127. Cognitive overload: Jailbreaking large language models with overloaded logical thinking. arXiv preprint arXiv:2311.09827, 2023.
  128. A cross-language investigation into jailbreak attacks in large language models. arXiv preprint arXiv:2401.16765, 2024c.
  129. Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher, 2024.
  130. Catastrophic jailbreak of open-source llms via exploiting generation. arXiv preprint arXiv:2310.06987, 2023d.
  131. Smoothllm: Defending large language models against jailbreaking attacks. arXiv preprint arXiv:2310.03684, 2023.
  132. Robust prompt optimization for defending language models against jailbreaking attacks. arXiv preprint arXiv:2401.17263, 2024a.
  133. Baseline defenses for adversarial attacks against aligned language models. arXiv preprint arXiv:2309.00614, 2023.
  134. Llms can defend themselves against jailbreaking in a practical manner: A vision paper, 2024.
  135. Rain: Your language models can align themselves without finetuning. In The Twelfth International Conference on Learning Representations, 2023c.
  136. Kbioxlm: A knowledge-anchored biomedical multilingual pretrained language model. arXiv preprint arXiv:2311.11564, 2023.
  137. Biomistral: A collection of open-source pretrained large language models for medical domains. arXiv preprint arXiv:2402.10373, 2024.
  138. Towards building multilingual language model for medicine. arXiv preprint arXiv:2402.13963, 2024a.
  139. Apollo: Lightweight multilingual medical llms towards democratizing medical ai to 6b people. arXiv preprint arXiv:2403.03640, 2024a.
  140. Agasthya Gangavarapu. Introducing l2m3, a multilingual medical large language model to advance health equity in low-resource regions. arXiv preprint arXiv:2404.08705, 2024.
  141. Medical mt5: An open-source multilingual text-to-text llm for the medical domain. arXiv preprint arXiv:2404.07613, 2024.
  142. Lextreme: A multi-lingual and multi-task benchmark for the legal domain. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 3016–3054. Association for Computational Linguistics, 2023a.
  143. Multilegalsbd: A multilingual legal sentence boundary detection dataset. In Proceedings of the Nineteenth International Conference on Artificial Intelligence and Law, pages 42–51, 2023.
  144. Resolving legalese: A multilingual exploration of negation scope resolution in legal documents. arXiv preprint arXiv:2309.08695, 2023.
  145. Towards explainability and fairness in swiss judgement prediction: Benchmarking on a multilingual dataset. arXiv preprint arXiv:2402.17013, 2024.
  146. Multilegalpile: A 689gb multilingual legal corpus. arXiv preprint arXiv:2306.02069, 2023b.
  147. Lexfiles and legallama: Facilitating english multinational legal language model development. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15513–15535, 2023.
  148. Legal prompt engineering for multilingual legal judgement prediction. arXiv preprint arXiv:2212.02199, 2022.
  149. mteb. Amazon massive intent, a. URL https://huggingface.co/datasets/mteb/amazon_massive_intent.
  150. The multilingual amazon reviews corpus. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, 2020a.
  151. Bactrian-x: Multilingual replicable instruction-following models with low-rank adaptation, 2023d.
  152. eBible.org. ebible. URL https://github.com/BibleNLP.
  153. Unsupervised cross-lingual representation learning at scale. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451, Online, July 2020a. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.747. URL https://aclanthology.org/2020.acl-main.747.
  154. CCNet: Extracting high quality monolingual datasets from web crawl data. In Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, and Stelios Piperidis, editors, Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4003–4012, Marseille, France, May 2020. European Language Resources Association. URL https://aclanthology.org/2020.lrec-1.494.
  155. Culturax: A cleaned, enormous, and multilingual dataset for large language models in 167 languages, 2023.
  156. erfanzar. Multi-turn conversational prompts from chatgpt-4. URL https://huggingface.co/datasets/erfanzar/GPT-4-Prompts.
  157. Qlora: Efficient finetuning of quantized llms, 2023.
  158. A new massive multilingual dataset for high-performance language technologies, 2024.
  159. Overview of the IWSLT 2017 evaluation campaign. In Proceedings of the 14th International Conference on Spoken Language Translation, pages 2–14, Tokyo, Japan, December 14-15 2017. International Workshop on Spoken Language Translation. URL https://aclanthology.org/2017.iwslt-1.1.
  160. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv e-prints, 2019.
  161. XTREME-R: Towards more challenging and nuanced multilingual evaluation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10215–10245, Online and Punta Cana, Dominican Republic, November 2021a. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.802. URL https://aclanthology.org/2021.emnlp-main.802.
  162. Multilingual and cross-lingual intent detection from spoken data. CoRR, abs/2104.08524, 2021. URL https://arxiv.org/abs/2104.08524.
  163. Making a miracl: Multilingual information retrieval across a continuum of languages, 2022b.
  164. Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation, 2024b.
  165. Towards building multilingual language model for medicine, 2024b.
  166. MFAQ: a multilingual FAQ dataset. In Proceedings of the 3rd Workshop on Machine Reading for Question Answering, pages 1–13, Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. URL https://aclanthology.org/2021.mrqa-1.1.
  167. tyqiangz. Multilingual sentiment datasets. URL https://github.com/tyqiangz/multilingual-sentiment-datasets.
  168. Multiconer v2: a large multilingual dataset for fine-grained and noisy named entity recognition. arXiv preprint arXiv:2310.13213, 2023a.
  169. SemEval-2023 Task 2: Fine-grained Multilingual Named Entity Recognition (MultiCoNER 2). In Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023). Association for Computational Linguistics, 2023b.
  170. Nomiracl: Knowing when you don’t know for robust multilingual retrieval-augmented generation. ArXiv, abs/2312.11361, 2023.
  171. Opensubtitles2016: Extracting large parallel corpora from movie and tv subtitles, 2016.
  172. Towards a Cleaner Document-Oriented Multilingual Crawled Corpus. arXiv e-prints, art. arXiv:2201.06642, January 2022.
  173. ParaPat: The multi-million sentences parallel corpus of patents abstracts. In Proceedings of The 12th Language Resources and Evaluation Conference, pages 3769–3774, Marseille, France, May 2020. European Language Resources Association. ISBN 979-10-95546-34-4. URL https://www.aclweb.org/anthology/2020.lrec-1.465.
  174. Project Gutenberg. Project gutenberg. URL https://www.gutenberg.org/.
  175. RyokoAI. Sharegpt52k. URL https://huggingface.co/datasets/RyokoAI/ShareGPT52K.
  176. Redfm: a filtered and multilingual relation extraction dataset. In Proc. of the 61st Annual Meeting of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 2023. Association for Computational Linguistics. URL https://arxiv.org/abs/2306.09802.
  177. When and why are pre-trained word embeddings useful for neural machine translation? In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 529–535, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/N18-2084. URL https://aclanthology.org/N18-2084.
  178. WIT3: Web inventory of transcribed and translated talks. In Proceedings of the 16th Annual conference of the European Association for Machine Translation, pages 261–268, Trento, Italy, May 28–30 2012. European Association for Machine Translation. URL https://www.aclweb.org/anthology/2012.eamt-1.60.
  179. FredZhang. toxi-text-3m. URL https://huggingface.co/datasets/FredZhang7/toxi-text-3M.
  180. Universal Dependencies v2: An evergrowing multilingual treebank collection. In Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, and Stelios Piperidis, editors, Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4034–4043, Marseille, France, May 2020. European Language Resources Association. ISBN 979-10-95546-34-4. URL https://aclanthology.org/2020.lrec-1.497.
  181. Cross-lingual name tagging and linking for 282 languages. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1946–1958, Vancouver, Canada, July 2017. Association for Computational Linguistics. doi: 10.18653/v1/P17-1178. URL https://www.aclweb.org/anthology/P17-1178.
  182. Massively multilingual transfer for NER. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 151–164, Florence, Italy, July 2019. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/P19-1015.
  183. Wikimedia Foundation. Wikimedia downloads. URL https://dumps.wikimedia.org.
  184. Wit: Wikipedia-based image text dataset for multimodal multilingual machine learning. arXiv preprint arXiv:2103.01913, 2021.
  185. Crosslingual generalization through multitask finetuning. arXiv preprint arXiv:2211.01786, 2022a.
  186. Afrisenti: A twitter sentiment analysis benchmark for african languages, 2023.
  187. Little red riding hood goes around the globe:crosslingual story planning and generation with large language models, 2024.
  188. The belebele benchmark: a parallel reading comprehension dataset in 122 language variants. arXiv preprint arXiv:2308.16884, 2023.
  189. BioMistral. Bioinstructqa. URL https://huggingface.co/datasets/BioMistral/BioInstructQA.
  190. mteb. Mteb benchmark, b. URL https://huggingface.co/datasets/mteb/bucc-bitext-mining.
  191. Crosssum: Beyond english-centric cross-lingual summarization for 1,500+ language pairs, 2023.
  192. Crossmodal-3600: A massively multilingual multimodal evaluation dataset, 2022.
  193. Exams: A multi-subject high school examinations dataset for cross-lingual and multilingual question answering, 2020.
  194. Fairlex: A multilingual benchmark for evaluating fairness in legal text processing. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4389–4406, 2022a.
  195. NLLB Team. No language left behind: Scaling human-centered machine translation. arXiv preprint arXiv:2207.04672, 2022.
  196. The flores-101 evaluation benchmark for low-resource and multilingual machine translation. Transactions of the Association for Computational Linguistics, 10:522–538, 2022.
  197. Two new evaluation datasets for low-resource machine translation: Nepali-english and sinhala-english. arXiv preprint arXiv:1902.01382, 2019.
  198. Geomlama: Geo-diverse commonsense probing on multilingual pre-trained language models, 2022.
  199. Humaneval-xl: A multilingual code generation benchmark for cross-lingual natural language generalization. arXiv preprint arXiv:2402.16694, 2024.
  200. Okapi: Instruction-tuned large language models in multiple languages with reinforcement learning from human feedback. arXiv e-prints, pages arXiv–2307, 2023.
  201. M3exam: A multilingual, multimodal, multilevel benchmark for examining large language models, 2023d.
  202. Large scale multi-lingual multi-modal summarization dataset. In Andreas Vlachos and Isabelle Augenstein, editors, Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 3620–3632, Dubrovnik, Croatia, May 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.eacl-main.263. URL https://aclanthology.org/2023.eacl-main.263.
  203. The multilingual amazon reviews corpus, 2020b.
  204. Masakhaner: Named entity recognition for african languages, 2021.
  205. Masakhanews: News topic classification for african languages, 2023a.
  206. Massive: A 1m-example multilingual natural language understanding dataset with 51 typologically-diverse languages, 2022.
  207. Maxm: Towards multilingual visual question answering, 2023.
  208. Mega: Multilingual evaluation of generative ai, 2023.
  209. Megaverse: Benchmarking large language models across languages, modalities, models and tasks, 2024.
  210. Mela: Multilingual evaluation of linguistic acceptability, 2024b.
  211. Mlqa: Evaluating cross-lingual extractive question answering, 2020b.
  212. Towards building multilingual language model for medicine, 2024c.
  213. Multiconer: A large-scale multilingual dataset for complex named entity recognition, 2022.
  214. MultiEURLEX - a multi-lingual and multi-label legal document classification dataset for zero-shot cross-lingual transfer. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6974–6996, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.559. URL https://aclanthology.org/2021.emnlp-main.559.
  215. Multi-lingual and multi-cultural figurative language understanding. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Findings of the Association for Computational Linguistics: ACL 2023, pages 8269–8284, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-acl.525. URL https://aclanthology.org/2023.findings-acl.525.
  216. Nusax: Multilingual parallel sentiment dataset for 10 indonesian local languages, 2023b.
  217. Execution-based evaluation for open-domain code generation, 2023d.
  218. PAWS-X: A cross-lingual adversarial dataset for paraphrase identification. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3687–3692, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1382. URL https://aclanthology.org/D19-1382.
  219. Pmindiasum: Multilingual and cross-lingual headline summarization for languages in india, 2023.
  220. Presto: A multilingual dataset for parsing realistic task-oriented dialogs, 2023.
  221. Seahorse: A multilingual, multifaceted dataset for summarization evaluation, 2023.
  222. Sib-200: A simple, inclusive, and big evaluation dataset for topic classification in 200+ languages and dialects, 2023b.
  223. Multilingual entity and relation extraction dataset and model. In Paola Merlo, Jorg Tiedemann, and Reut Tsarfaty, editors, Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 1946–1955, Online, April 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.eacl-main.166. URL https://aclanthology.org/2021.eacl-main.166.
  224. Philip May. Machine translated multilingual sts benchmark dataset., 2021. URL https://github.com/PhilipMay/stsb-multi-mt.
  225. Jörg Tiedemann. The tatoeba translation challenge – realistic data sets for low resource and multilingual MT. In Proceedings of the Fifth Conference on Machine Translation, pages 1174–1182, Online, November 2020. Association for Computational Linguistics. URL https://aclanthology.org/2020.wmt-1.139.
  226. TyDiP: A dataset for politeness classification in nine typologically diverse languages. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Findings of the Association for Computational Linguistics: EMNLP 2022, pages 5723–5738, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-emnlp.420. URL https://aclanthology.org/2022.findings-emnlp.420.
  227. Tydi qa: A benchmark for information-seeking question answering in ty pologically di verse languages. Transactions of the Association for Computational Linguistics, 8:454–470, 2020.
  228. Lost in translation, found in spans: Identifying claims in multilingual social media. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 3887–3902, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.236. URL https://aclanthology.org/2023.emnlp-main.236.
  229. X-risawoz: High-quality end-to-end multilingual dialogue datasets and few-shot agents, 2023.
  230. XCOPA: A multilingual dataset for causal commonsense reasoning. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2362–2376, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.185. URL https://aclanthology.org/2020.emnlp-main.185.
  231. xDial-eval: A multilingual open-domain dialogue evaluation benchmark. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023, pages 5579–5601, Singapore, December 2023e. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.371. URL https://aclanthology.org/2023.findings-emnlp.371.
  232. Xglue: A new benchmark dataset for cross-lingual pre-training, understanding and generation, 2020.
  233. Xl-sum: Large-scale multilingual abstractive summarization for 44 languages, 2021.
  234. Xnli: Evaluating cross-lingual sentence representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2475–2485, 2018.
  235. On the cross-lingual transferability of monolingual representations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4623–4637, 2020.
  236. Xsemplr: Cross-lingual semantic parsing in multiple natural languages and meaning representations, 2023f.
  237. Xtreme: A massively multilingual multi-task benchmark for evaluating cross-lingual generalization, 2020.
  238. XTREME-R: Towards more challenging and nuanced multilingual evaluation. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10215–10245, Online and Punta Cana, Dominican Republic, November 2021b. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.802. URL https://aclanthology.org/2021.emnlp-main.802.
  239. It’s All in the Heads: Using Attention Heads as a Baseline for Cross-Lingual Transfer in Commonsense Reasoning. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors, Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 3534–3546, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.findings-acl.310. URL https://aclanthology.org/2021.findings-acl.310.
  240. Common sense beyond English: Evaluating and improving multilingual language models for commonsense reasoning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1274–1287, Online, August 2021c. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.102. URL https://aclanthology.org/2021.acl-long.102.
  241. Scaling data-constrained language models. Advances in Neural Information Processing Systems, 36, 2024.
  242. Scaling laws for generative mixed-modal language models. In International Conference on Machine Learning, pages 265–279. PMLR, 2023.
  243. A survey of large language models. arXiv preprint arXiv:2303.18223, 2023.
  244. Ernie-m: Enhanced multilingual representation by aligning cross-lingual semantics with monolingual corpora. arXiv preprint arXiv:2012.15674, 2020.
  245. mlongt5: A multilingual and efficient text-to-text transformer for longer sequences. arXiv preprint arXiv:2305.11129, 2023.
  246. Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116, 2019.
  247. Xlm-v: Overcoming the vocabulary bottleneck in multilingual masked language models. arXiv preprint arXiv:2301.10472, 2023.
  248. Unimax: Fairer and more effective language sampling for large-scale multilingual pretraining. arXiv preprint arXiv:2304.09151, 2023.
  249. A survey of multilingual neural machine translation. ACM Computing Surveys (CSUR), 53(5):1–38, 2020.
  250. Towards continual learning for multilingual machine translation via vocabulary substitution. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1184–1192, 2021.
  251. Entropy-based vocabulary substitution for incremental learning in multilingual neural machine translation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10537–10550, 2022.
  252. How robust is neural machine translation to language imbalance in multilingual tokenizer training? arXiv preprint arXiv:2204.14268, 2022c.
  253. Efficient and effective text encoding for chinese llama and alpaca. arXiv preprint arXiv:2304.08177, 2023a.
  254. HIT-SCIR. Chinese-mixtral-8x7b: An open-source mixture-of-experts llm. https://github.com/HIT-SCIR/Chinese-Mixtral-8x7B, 2024.
  255. Cpm-2: Large-scale cost-effective pre-trained language models. AI Open, 2:216–224, 2021.
  256. Llamantino: Llama 2 models for effective text generation in italian language. arXiv preprint arXiv:2312.09993, 2023.
  257. Fingpt: Large generative models for a small language. In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023.
  258. Sabiá: Portuguese large language models. In Brazilian Conference on Intelligent Systems, pages 226–240. Springer, 2023.
  259. Introducing bode: A fine-tuned large language model for portuguese prompt-based task. arXiv preprint arXiv:2401.02909, 2024.
  260. Glot500: Scaling multilingual corpora and language models to 500 languages. arXiv preprint arXiv:2305.12182, 2023.
  261. How to adapt your pretrained multilingual model to 1600 languages. arXiv preprint arXiv:2106.02124, 2021.
  262. Adapting pre-trained language models to african languages via multilingual adaptive fine-tuning. arXiv preprint arXiv:2204.06487, 2022.
  263. Crosslingual generalization through multitask finetuning. arXiv preprint arXiv:2211.01786, 2022b.
  264. Expanding pretrained models to thousands more languages via lexicon-based adaptation. arXiv preprint arXiv:2203.09435, 2022a.
  265. Squad: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392, 2016.
  266. Docnli: A large-scale dataset for document-level natural language inference. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 4913–4922, 2021.
  267. Massively multilingual neural machine translation in the wild: Findings and challenges. arXiv preprint arXiv:1907.05019, 2019.
  268. Balancing training for multilingual neural machine translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8526–8537, 2020.
  269. Uncertainty-aware balancing for multilingual and multi-domain neural machine translation training. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7291–7305, 2021.
  270. Multilingual agreement for multilingual neural machine translation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 233–239, 2021.
  271. Contrastive learning for many-to-many multilingual neural machine translation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 244–258, 2021.
  272. Bridging linguistic typology and multilingual machine translation with multi-view language representations. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2391–2406, 2020.
  273. Multilingual mix: Example interpolation improves multilingual neural machine translation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4092–4102, 2022.
  274. Viewing knowledge transfer in multilingual machine translation through a representational lens. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 14973–14987, 2023.
  275. Parameter differentiation based multilingual neural machine translation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 11440–11448, 2022.
  276. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020.
  277. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024.
  278. Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023b.
  279. Visual instruction tuning. Advances in neural information processing systems, 36, 2024b.
  280. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023.
  281. Cross-lingual prompting: Improving zero-shot chain-of-thought reasoning across languages. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2695–2709, 2023.
  282. Do multilingual language models think better in english? arXiv preprint arXiv:2308.01223, 2023.
  283. Commonsense reasoning for natural language processing. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts, pages 27–33, 2020.
  284. Nature language reasoning, a survey. arXiv preprint arXiv:2303.14725, 2023a.
  285. Logiqa 2.0—an improved dataset for logical reasoning in natural language understanding. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023d.
  286. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations, 2022b.
  287. Faithful chain-of-thought reasoning. In Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 305–329, 2023.
  288. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022a.
  289. Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997, 2023a.
  290. Exploring human-like translation strategy with large language models, 2023a.
  291. Lince: A centralized benchmark for linguistic code-switching evaluation. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 1803–1813, 2020.
  292. Overview and results of mixmt shared-task at wmt 2022. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 806–811, 2022.
  293. Gupshup: Summarizing open-domain code-switched conversations. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6177–6192, 2021.
  294. Speak foreign languages with your own voice: Cross-lingual neural codec language modeling. arXiv preprint arXiv:2303.03926, 2023g.
  295. Zero resource code-switched speech benchmark using speech utterance pairs for multiple spoken languages. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 10006–10010. IEEE, 2024.
  296. A survey of code-switching: Linguistic and social perspectives for language technologies. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1654–1666, 2021.
  297. Are multilingual models effective in code-switching? In Proceedings of the Fifth Workshop on Computational Approaches to Linguistic Code-Switching, pages 142–153, 2021.
  298. Improving pretraining techniques for code-switched NLP. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1176–1191, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.66. URL https://aclanthology.org/2023.acl-long.66.
  299. Beyond english-centric multilingual machine translation. Journal of Machine Learning Research, 22:1–48, 2021.
  300. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022b.
  301. Low-resource languages: A review of past work and future challenges. arXiv preprint arXiv:2006.07264, 2020.
  302. Neural machine translation for low-resource languages: A survey. ACM Computing Surveys, 55:1–37, 2023.
  303. Regularization techniques for fine-tuning in neural machine translation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1489–1494, 2017.
  304. Exploiting multilingualism through multistage fine-tuning for low-resource neural machine translation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1410–1416, 2019.
  305. Simple, scalable adaptation for neural machine translation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1538–1548, 2019.
  306. Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts, 2023b.
  307. Deepinception: Hypnotize large language model to be jailbreaker. arXiv preprint arXiv:2311.03191, 2023e.
  308. Jailbreak and guard aligned language models with only few in-context demonstrations. arXiv preprint arXiv:2310.06387, 2023b.
  309. Jailbreaking black box large language models in twenty queries. In R0-FoMo: Robustness of Few-shot and Zero-shot Learning in Large Foundation Models, 2023.
  310. A wolf in sheep’s clothing: Generalized nested jailbreak prompts can fool large language models easily. arXiv preprint arXiv:2311.08268, 2023.
  311. Codechameleon: Personalized encryption framework for jailbreaking large language models. arXiv preprint arXiv:2402.16717, 2024.
  312. Easyjailbreak: A unified framework for jailbreaking large language models. arXiv preprint arXiv:2403.12171, 2024b.
  313. Testing a collaborative ddos defense in a red team/blue team exercise. IEEE Transactions on Computers, 57:1098–1112, 2008.
  314. Tree of attacks: Jailbreaking black-box llms automatically. arXiv preprint arXiv:2312.02119, 2023.
  315. Red-teaming large language models using chain of utterances for safety-alignment. arXiv preprint arXiv:2308.09662, 2023.
  316. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022.
  317. John Sweller. Cognitive load during problem solving: Effects on learning. Cognitive science, 12:257–285, 1988.
  318. John Sweller. Cognitive load theory. In Psychology of learning and motivation, volume 55, pages 37–76. Elsevier, 2011.
  319. From theory to practice: the application of cognitive load theory to the practice of medicine. Academic Medicine, 96:24–30, 2021.
  320. Jailbreak in pieces: Compositional adversarial attacks on multi-modal language models. In The Twelfth International Conference on Learning Representations, 2023.
  321. Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010, 2023b.
  322. Universal jailbreak backdoors from poisoned human feedback. In The Twelfth International Conference on Learning Representations, 2023.
  323. Fundamental limitations of alignment in large language models. arXiv preprint arXiv:2304.11082, 2023.
  324. Alex Albert. Jailbreak chat. https://www.jailbreakchat.com, 2023. Accessed: 2024-02-20.
  325. Jailbreakbench: An open robustness benchmark for jailbreaking large language models, 2024.
  326. Exploring multilingual human value concepts in large language models: Is value alignment consistent, transferable and controllable across languages?, 2024b.
  327. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal. arXiv preprint arXiv:2402.04249, 2024.
  328. Fingpt: Open-source financial large language models. arXiv preprint arXiv:2306.06031, 2023d.
  329. Fingpt: Democratizing internet-scale data for financial large language models. arXiv preprint arXiv:2307.10485, 2023e.
  330. Disc-finllm: A chinese financial large language model based on multiple experts fine-tuning. arXiv preprint arXiv:2310.15205, 2023b.
  331. Bloomberggpt: A large language model for finance. arXiv preprint arXiv:2303.17564, 2023a.
  332. Investlm: A large language model for investment using financial domain instruction tuning. arXiv preprint arXiv:2309.13064, 2023e.
  333. Chimed-gpt: A chinese medical large language model with full training regime and better alignment to human preferences. arXiv preprint arXiv:2311.06025, 2023.
  334. Alpacare:instruction-tuned large language models for medical application, 2023h.
  335. Taiyi: a bilingual fine-tuned large language model for diverse biomedical tasks. Journal of the American Medical Informatics Association, page ocae037, 2024.
  336. Mentalllama: Interpretable mental health analysis on social media with large language models. arXiv preprint arXiv:2309.13567, 2023f.
  337. Chatcounselor: A large language models for mental health support, 2023f.
  338. Huatuogpt, towards taming language models to be a doctor. arXiv preprint arXiv:2305.15075, 2023i.
  339. Doctorglm: Fine-tuning your chinese doctor is not a herculean task. arXiv preprint arXiv:2304.01097, 2023.
  340. Disc-lawllm: Fine-tuning large language models for intelligent legal services, 2023.
  341. Sailer: Structure-aware pre-trained language model for legal case retrieval, 2023f.
  342. Lawyer llama technical report. ArXiv, abs/2305.15062, 2023e.
  343. Hanfei-1.0. https://github.com/siat-nlp/HanFei, 2023b.
  344. Chatlaw: Open-source legal large language model with integrated external knowledge bases. arXiv preprint arXiv:2306.16092, 2023b.
  345. Educhat: A large-scale language model-based chatbot system for intelligent education. arXiv preprint arXiv:2308.02773, 2023.
  346. Taoli llama. https://github.com/blcuicall/taoli, 2023c.
  347. Transgpt: Multi-modal generative pre-trained transformer for transportation. arXiv preprint arXiv:2402.07233, 2024b.
  348. Meditron-70b: Scaling medical pretraining for large language models. arXiv preprint arXiv:2311.16079, 2023c.
  349. Large language models encode clinical knowledge. Nature, 620(7972):172–180, 2023.
  350. Clinicalgpt: large language models finetuned with diverse medical data and comprehensive evaluation. arXiv preprint arXiv:2306.09968, 2023e.
  351. Pmc-llama: Further finetuning llama on medical papers. arXiv preprint arXiv:2304.14454, 2023b.
  352. Medalpaca–an open-source collection of medical conversational ai models and training data. arXiv preprint arXiv:2304.08247, 2023.
  353. Chatdoctor: A medical chat model fine-tuned on llama model using medical domain knowledge. arXiv preprint arXiv:2303.14070, 2023.
  354. Huatuogpt-ii, one-stage training for medical adaption of llms. arXiv preprint arXiv:2311.09774, 2023d.
  355. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2020b.
  356. Lawyer llama technical report. arXiv preprint arXiv:2305.15062, 2023f.
  357. Saullm-7b: A pioneering large language model for law. arXiv preprint arXiv:2403.03883, 2024.
  358. Legal-bert: The muppets straight out of law school. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 2898–2904, 2020.
  359. A review on the application of deep learning in legal domain. In Artificial Intelligence Applications and Innovations: 15th IFIP WG 12.5 International Conference, AIAI 2019, Hersonissos, Crete, Greece, May 24–26, 2019, Proceedings 15, pages 374–381. Springer, 2019.
  360. Rules and norms: Requirements for rule interchange languages in the legal domain. In International Workshop on Rules and Rule Markup Languages for the Semantic Web, pages 282–296. Springer, 2009.
  361. Gpts and language barrier: A cross-lingual legal qa examination. arXiv preprint arXiv:2403.18098, 2024.
  362. A summary of the coliee 2019 competition. In New Frontiers in Artificial Intelligence: JSAI-isAI International Workshops, JURISIN, AI-Biz, LENLS, Kansei-AI, Yokohama, Japan, November 10–12, 2019, Revised Selected Papers 10, pages 34–49. Springer, 2020.
  363. Overview and discussion of the competition on legal information extraction/entailment (coliee) 2021. The Review of Socionetwork Strategies, 16(1):111–133, 2022.
  364. Lexglue: A benchmark dataset for legal language understanding in english. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4310–4330, 2022b.
  365. Lost in translation: Large language models in non-english content analysis. arXiv preprint arXiv:2306.07377, 2023.
  366. Asya Pereltsvaig. Languages of the World. Cambridge University Press, 2020.
  367. Edgar W Schneider. English and colonialism. In The Routledge handbook of English language studies, pages 42–58. Routledge, 2018.
  368. Alastair Pennycook. English and the discourses of colonialism. Routledge, 2002.
  369. The state and fate of linguistic diversity and inclusion in the NLP world. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6282–6293, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.560. URL https://aclanthology.org/2020.acl-main.560.
  370. Documenting large webtext corpora: A case study on the colossal clean crawled corpus. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1286–1305, 2021.
  371. Language ID in the wild: Unexpected challenges on the path to a thousand-language web text corpus. In Donia Scott, Nuria Bel, and Chengqing Zong, editors, Proceedings of the 28th International Conference on Computational Linguistics, pages 6588–6608, Barcelona, Spain (Online), December 2020. International Committee on Computational Linguistics. doi: 10.18653/v1/2020.coling-main.579. URL https://aclanthology.org/2020.coling-main.579.
  372. Quality at a glance: An audit of web-crawled multilingual datasets. Transactions of the Association for Computational Linguistics, 10:50–72, 2022.
  373. Participatory research for low-resourced machine translation: A case study in African languages. In Trevor Cohn, Yulan He, and Yang Liu, editors, Findings of the Association for Computational Linguistics: EMNLP 2020, pages 2144–2160, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.195. URL https://aclanthology.org/2020.findings-emnlp.195.
  374. Cultural bias in wikipedia content on famous persons. Journal of the American society for information science and technology, 62(10):1899–1915, 2011.
  375. Privacy in the time of language models. In Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining, pages 1291–1292, 2023.
  376. A survey on large language model (llm) security and privacy: The good, the bad, and the ugly. High-Confidence Computing, page 100211, 2024.
  377. On protecting the data privacy of large language models (llms): A survey. arXiv preprint arXiv:2403.05156, 2024.
  378. Deduplicating training data mitigates privacy risks in language models. In International Conference on Machine Learning, pages 10697–10707. PMLR, 2022.
  379. A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology, 15(3):1–45, 2024.
  380. Fineweb, 2024. URL https://huggingface.co/datasets/HuggingFaceFW/fineweb.
  381. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
  382. Dolma: An open corpus of three trillion tokens for language model pretraining research. arXiv preprint arXiv:2402.00159, 2024.
  383. Don’t trust chatgpt when your question is not in english: A study of multilingual abilities and types of llms. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7915–7927, 2023j.
  384. M3exam: A multilingual, multimodal, multilevel benchmark for examining large language models. Advances in Neural Information Processing Systems, 36, 2024c.
  385. Emilio Ferrara. Should chatgpt be biased? challenges and risks of bias in large language models. arXiv preprint arXiv:2304.03738, 2023.
  386. Assessing multilingual fairness in pre-trained multimodal representations. In Proceedings of Annual Meeting of Association for Computational Linguistics, 2022c.
  387. Comparing biases and the impact of multilingual training across multiple languages. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 10260–10280, 2023.
  388. Evaluating interfaced llm bias. In Proceedings of the 35th Conference on Computational Linguistics and Speech Processing (ROCLING 2023), pages 292–299, 2023.
  389. Social biases in nlp models as barriers for persons with disabilities. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5491–5501, 2020.
  390. Stereoset: Measuring stereotypical bias in pretrained language models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 5356–5371, 2021.
  391. “kelly is a warm person, joseph is a role model”: Gender biases in llm-generated reference letters. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 3730–3748, 2023.
  392. Rtp-lx: Can llms evaluate toxicity in multilingual scenarios? arXiv preprint arXiv:2404.14397, 2024.
  393. What is your favorite gender, mlm? gender bias evaluation in multilingual masked language models. arXiv preprint arXiv:2404.06621, 2024.
  394. Gender bias in multilingual embeddings and cross-lingual transfer. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2896–2907, 2020.
  395. Are pretrained multilingual models equally fair across languages? In Proceedings of the 29th International Conference on Computational Linguistics, pages 3597–3605, 2022.
  396. On evaluating and mitigating gender biases in multilingual settings. arXiv preprint arXiv:2307.01503, 2023.
  397. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, 2024.
  398. Investigating bias in multilingual language models: Cross-lingual transfer of debiasing techniques. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2887–2896, 2023.
  399. Style transfer through back-translation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 866–876, 2018.
  400. Iterative back-translation for neural machine translation. In Proceedings of the 2nd Workshop on Neural Machine Translation and Generation, pages 18–24, 2018.
  401. Richard W Brislin. Back-translation for cross-cultural research. Journal of cross-cultural psychology, 1:185–216, 1970.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (12)
  1. Kaiyu Huang (16 papers)
  2. Fengran Mo (35 papers)
  3. Hongliang Li (58 papers)
  4. You Li (58 papers)
  5. Yuanchi Zhang (7 papers)
  6. Weijian Yi (1 paper)
  7. Yulong Mao (2 papers)
  8. Jinchen Liu (2 papers)
  9. Yuzhuang Xu (12 papers)
  10. Jinan Xu (64 papers)
  11. Jian-Yun Nie (70 papers)
  12. Yang Liu (2253 papers)
Citations (10)