Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Exploring Multi-Lingual Bias of Large Code Models in Code Generation (2404.19368v1)

Published 30 Apr 2024 in cs.SE

Abstract: Code generation aims to synthesize code and fulfill functional requirements based on natural language (NL) specifications, which can greatly improve development efficiency. In the era of LLMs, large code models (LCMs) have been recently proposed to generate source code. LCMs can generate highly feasible solutions for programming problems described in natural language. Despite the effectiveness, we observe a noticeable multilingual bias in the generation performance of LCMs. Specifically, LCMs demonstrate proficiency in generating solutions when provided with instructions in English, yet may falter when faced with semantically equivalent instructions in other NLs such as Chinese. Moreover, the ability of LCMs to generate code exhibits variety across different programming languages (PLs), such as Python and C++. The observed phenomenon indicates the presence of multi-lingual bias within the generative capabilities of LCMs, which has remained unexplored. In this paper, we aim to investigate the multi-lingual bias that exists in current LCMs. First, we initiate our investigation by constructing the first multi-lingual evaluation benchmark X-HumanEval-X, enabling us to systematically evaluate the extent of multi-lingual bias that exists in current LCMs. In our large-scale experiments on nine popular LCMs, we observe a pronounced multi-lingual bias of LCMs in code generation, including multi-NL and multi-PL bias. Specifically, when using Chinese instructions, the code generation capabilities of LCMs decrease by at least 13% in terms of the Pass@1 metric. Furthermore, LCMs perform variously across different programming languages, e.g., the performance gap between Python and C++ reaches as high as 20.9%. ...

Definition Search Book Streamline Icon: https://streamlinehq.com
References (64)
  1. 2023 Developer Survey. https://survey.stackoverflow.co/2023/#most-popular-technologies-language-other.
  2. ChatGPT. https://chat.openai.com/.
  3. CodeAlpaca. https://github.com/sahil280114/codealpaca.
  4. CodeX. https://openai.com/blog/openai-codex/.
  5. Copilot. https://github.com/features/copilot.
  6. Gemini. https://deepmind.google/technologies/gemini/#introduction.
  7. Google Translation. https://translate.google.com/.
  8. mlmproject. https://mlm.lingyiwanwu.com/.
  9. tabnine. https://www.tabnine.com/.
  10. The top programming languages. https://octoverse.github.com/2022/top-programming-languages.
  11. Top 10 most spoken languages in the world in 2024. https://www.forbesindia.com/article/explainers/most-spoken-languages-world/91687/1.
  12. I. Akvelon, “Github copilot efficiency explored: Key takeaways from akvelon’s survey,” https://medium.com/@akvelonsocialmedia/github-copilot-efficiency-explored-key-takeaways-from-akvelons-survey-1b46e2391f0e, 2023.
  13. Z. Ali, H. Darwis, L. B. Ilmawan, S. R. Jabir, A. R. Manga et al., “Memory efficient with parameter efficient fine-tuning for code generation using quantization,” in 2024 18th International Conference on Ubiquitous Information Management and Communication (IMCOM).   IEEE, 2024, pp. 1–6.
  14. L. B. Allal, R. Li, D. Kocetkov, C. Mou, C. Akiki, C. M. Ferrandis, N. Muennighoff, M. Mishra, A. Gu, M. Dey et al., “Santacoder: don’t reach for the stars!” in Deep Learning for Code (DL4C) Workshop, 2023.
  15. D. M. Alves, J. Pombal, N. M. Guerreiro, P. H. Martins, J. Alves, A. Farajian, B. Peters, R. Rei, P. Fernandes, S. Agrawal et al., “Tower: An open multilingual large language model for translation-related tasks,” arXiv preprint arXiv:2402.17733, 2024.
  16. R. Anil, A. M. Dai, O. Firat, M. Johnson, D. Lepikhin, A. Passos, S. Shakeri, E. Taropa, P. Bailey, Z. Chen et al., “Palm 2 technical report,” arXiv preprint arXiv:2305.10403, 2023.
  17. Y. Bang, S. Cahyawijaya, N. Lee, W. Dai, D. Su, B. Wilie, H. Lovenia, Z. Ji, T. Yu, W. Chung, Q. V. Do, Y. Xu, and P. Fung, “A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity,” IJCNLP, 2023.
  18. T. B. Brown and et al., “Language models are few-shot learners,” ArXiv, vol. abs/2005.14165, 2020. [Online]. Available: https://api.semanticscholar.org/CorpusID:218971783
  19. W. Chaozheng, Y. Yuanhang, G. Cuiyun, P. Yun, Z. Hongyu, and M. R. Lyu, “Prompt tuning in code intelligence: An experimental evaluation,” IEEE Transactions on Software Engineering, 2023.
  20. M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman et al., “Evaluating large language models trained on code,” arXiv preprint arXiv:2107.03374, 2021.
  21. P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei, “Deep reinforcement learning from human preferences,” Advances in neural information processing systems, vol. 30, 2017.
  22. C. Cummins, V. Seeker, D. Grubisic, M. Elhoushi, Y. Liang, B. Roziere, J. Gehring, F. Gloeckle, K. Hazelwood, G. Synnaeve et al., “Large language models for compiler optimization,” arXiv preprint arXiv:2309.07062, 2023.
  23. Y. Deng, W. Zhang, S. J. Pan, and L. Bing, “Multilingual jailbreak challenges in large language models,” ICLR, 2023.
  24. N. Foroutan, M. Banaei, K. Aberer, and A. Bosselut, “Breaking the language barrier: Improving cross-lingual reasoning with structured self-attention,” in Conference on Empirical Methods in Natural Language Processing, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:264439333
  25. S. Gao, C. Gao, C. Wang, J. Sun, D. Lo, and Y. Yu, “Two sides of the same coin: Exploiting the impact of identifiers in neural code comprehension,” in 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE).   IEEE, 2023, pp. 1933–1945.
  26. D. Guo, Q. Zhu, D. Yang, Z. Xie, K. Dong, W. Zhang, G. Chen, X. Bi, Y. Wu, Y. Li et al., “Deepseek-coder: When the large language model meets programming–the rise of code intelligence,” arXiv preprint arXiv:2401.14196, 2024.
  27. J. Hu, S. Ruder, A. Siddhant, G. Neubig, O. Firat, and M. Johnson, “Xtreme: A massively multilingual multi-task benchmark for evaluating cross-lingual generalisation,” in International Conference on Machine Learning.   PMLR, 2020, pp. 4411–4421.
  28. H. Huang, T. Tang, D. Zhang, W. X. Zhao, T. Song, Y. Xia, and F. Wei, “Not all languages are created equal in llms: Improving multilingual capability by cross-lingual-thought prompting,” EMNLP, 2023.
  29. L. Ilya and H. Frank, “Decoupled weight decay regularization,” International Conference on Learning Representations, ICLR, 2018.
  30. W. Jiao, W. Wang, J.-t. Huang, X. Wang, S. Shi, and Z. Tu, “Is chatgpt a good translator? yes with gpt-4 as the engine,” arXiv preprint arXiv:2301.08745, 2023.
  31. W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica, “Efficient memory management for large language model serving with pagedattention,” in Proceedings of the 29th Symposium on Operating Systems Principles, 2023, pp. 611–626.
  32. V. D. Lai, N. T. Ngo, A. P. B. Veyseh, H. Man, F. Dernoncourt, T. Bui, and T. H. Nguyen, “Chatgpt beyond english: Towards a comprehensive evaluation of large language models in multilingual learning,” Findings of EMNLP, 2023.
  33. R. Li, Y. Zi, N. Muennighoff, D. Kocetkov, C. Mou, M. Marone, C. Akiki, L. Jia, J. Chim, Q. Liu et al., “Starcoder: may the source be with you!” Transactions on Machine Learning Research, 2023.
  34. Z. Li, P. Ma, H. Wang, S. Wang, Q. Tang, S. Nie, and S. Wu, “Unleashing the power of compiler intermediate representation to enhance neural program embeddings,” in Proceedings of the 44th International Conference on Software Engineering, 2022, pp. 2253–2265.
  35. Z. Li, C. Wang, Z. Liu, H. Wang, D. Chen, S. Wang, and C. Gao, “Cctest: Testing and repairing code completion systems,” in Proceedings of the 45th International Conference on Software Engineering, ser. ICSE ’23, 2023, p. 1238–1250.
  36. Z. Li, C. Wang, P. Ma, C. Liu, S. Wang, D. Wu, and C. Gao, “On the feasibility of specialized ability stealing for large language code models,” 2023.
  37. Z. Li, C. Wang, S. Wang, and G. Cuiyun, “Protecting intellectual property of large language model-based code generation apis via watermarks,” in Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security, CCS 2023, Copenhagen, Denmark, November 26-30, 2023.   ACM, 2023.
  38. I. Loshchilov and F. Hutter, “Sgdr: Stochastic gradient descent with warm restarts,” in International Conference on Learning Representations, 2016.
  39. A. Lozhkov, R. Li, L. B. Allal, F. Cassano, J. Lamy-Poirier, N. Tazi, A. Tang, D. Pykhtar, J. Liu, Y. Wei et al., “Starcoder 2 and the stack v2: The next generation,” arXiv preprint arXiv:2402.19173, 2024.
  40. Z. Luo, C. Xu, P. Zhao, Q. Sun, X. Geng, W. Hu, C. Tao, J. Ma, Q. Lin, and D. Jiang, “Wizardcoder: Empowering code large language models with evol-instruct,” in The Twelfth International Conference on Learning Representations, 2023.
  41. E. Nijkamp, B. Pang, H. Hayashi, L. Tu, H. Wang, Y. Zhou, S. Savarese, and C. Xiong, “Codegen: An open large language model for code with multi-turn program synthesis,” arXiv preprint arXiv:2203.13474, 2022.
  42. OpenAI, “Gpt-4 technical report,” 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:257532815
  43. K. Peng, L. Ding, Q. Zhong, L. Shen, X. Liu, M. Zhang, Y. Ouyang, and D. Tao, “Towards making the most of chatgpt for machine translation,” in Conference on Empirical Methods in Natural Language Processing, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:257704711
  44. R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn, “Direct preference optimization: Your language model is secretly a reward model,” Advances in Neural Information Processing Systems, vol. 36, 2024.
  45. S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He, “Zero: Memory optimizations toward training trillion parameter models,” in SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.   IEEE, 2020, pp. 1–16.
  46. J. Ren, S. Rajbhandari, R. Y. Aminabadi, O. Ruwase, S. Yang, M. Zhang, D. Li, and Y. He, “{{\{{Zero-offload}}\}}: Democratizing {{\{{billion-scale}}\}} model training,” in 2021 USENIX Annual Technical Conference (USENIX ATC 21), 2021, pp. 551–564.
  47. B. Roziere, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y. Adi, J. Liu, T. Remez, J. Rapin et al., “Code llama: Open foundation models for code,” arXiv preprint arXiv:2308.12950, 2023.
  48. B. Rozière, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y. Adi, J. Liu, T. Remez, J. Rapin et al., “Code llama: Open foundation models for code,” arXiv preprint arXiv:2308.12950, 2023.
  49. B. Shen, J. Zhang, T. Chen, D. Zan, B. Geng, A. Fu, M. Zeng, A. Yu, J. Ji, J. Zhao et al., “Pangu-coder2: Boosting large language models for code with ranking feedback,” arXiv preprint arXiv:2307.14936, 2023.
  50. F. Shi, M. Suzgun, M. Freitag, X. Wang, S. Srivats, S. Vosoughi, H. W. Chung, Y. Tay, S. Ruder, D. Zhou, D. Das, and J. Wei, “Language models are multilingual chain-of-thought reasoners,” ICLR, 2022.
  51. H. Touvron and et al., “Llama 2: Open foundation and fine-tuned chat models,” ArXiv, vol. abs/2307.09288, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:259950998
  52. H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale et al., “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023.
  53. C. Wang, S. Gao, P. Wang, C. Gao, W. Pei, L. Pan, and Z. Xu, “Label-aware distribution calibration for long-tailed classification,” IEEE Transactions on Neural Networks and Learning Systems, 2022.
  54. C. Wang, J. Hu, C. Gao, Y. Jin, T. Xie, H. Huang, Z. Lei, and Y. Deng, “How practitioners expect code completion?” in Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2023, pp. 1294–1306.
  55. C. Wang, Y. Yang, C. Gao, Y. Peng, H. Zhang, and M. R. Lyu, “No more fine-tuning? an experimental evaluation of prompt tuning in code intelligence,” in Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2022, pp. 382–394.
  56. W. Wang, Z. Tu, C. Chen, Y. Yuan, J. tse Huang, W. Jiao, and M. R. Lyu, “All languages matter: On the multilingual safety of large language models,” ArXiv, 2023.
  57. Y. Wei, Z. Wang, J. Liu, Y. Ding, and L. Zhang, “Magicoder: Source code is all you need,” arXiv preprint arXiv:2312.02120, 2023.
  58. B. Wodecki, “Chatgpt passes 1 billion page views,” https://aibusiness.com/nlp/chatgpt-passes-1b-page-views, 2023, accessed: 2024-03-01.
  59. C. Xu, Q. Sun, K. Zheng, X. Geng, P. Zhao, J. Feng, C. Tao, and D. Jiang, “Wizardlm: Empowering large language models to follow complex instructions,” arXiv preprint arXiv:2304.12244, 2023.
  60. Z. Yang, Z. Zhao, C. Wang, J. Shi, D. Kim, D. Han, and D. Lo, “Unveiling memorization in code models,” in 2024 IEEE/ACM 46th International Conference on Software Engineering (ICSE).   IEEE Computer Society, 2024, pp. 856–856.
  61. S. Zhang, L. Dong, X. Li, S. Zhang, X. Sun, S. Wang, J. Li, R. Hu, T. Zhang, F. Wu et al., “Instruction tuning for large language models: A survey,” arXiv preprint arXiv:2308.10792, 2023.
  62. J. Zhao, Z. Zhang, Q. Zhang, T. Gui, and X. Huang, “Llama beyond english: An empirical study on language capability transfer,” arXiv preprint arXiv:2401.01055, 2024.
  63. Q. Zheng, X. Xia, X. Zou, Y. Dong, S. Wang, Y. Xue, L. Shen, Z. Wang, A. Wang, Y. Li et al., “Codegeex: A pre-trained model for code generation with multilingual benchmarking on humaneval-x,” in Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2023, pp. 5673–5684.
  64. D. Zügner, T. Kirschstein, M. Catasta, J. Leskovec, and S. Günnemann, “Language-agnostic representation learning of source code from structure and context,” in International Conference on Learning Representations, 2020.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Chaozheng Wang (28 papers)
  2. Zongjie Li (29 papers)
  3. Cuiyun Gao (97 papers)
  4. Wenxuan Wang (128 papers)
  5. Ting Peng (14 papers)
  6. Hailiang Huang (21 papers)
  7. Yuetang Deng (10 papers)
  8. Shuai Wang (466 papers)
  9. Michael R. Lyu (176 papers)
Citations (7)
X Twitter Logo Streamline Icon: https://streamlinehq.com