CodeGeeX: A Multilingual Pre-Trained Model for Code Generation
The paper introduces CodeGeeX, a cutting-edge multilingual model for code generation capable of functioning across 23 programming languages with a significant parameter size of 13 billion. The model demonstrates superiority over existing multilingual code models of similar scale, emphasizing its capabilities in both code generation and translation, as evidenced by evaluations on the HumanEval-X benchmark.
Model Architecture and Training
CodeGeeX is built upon the transformer architecture, utilizing a 39-layer transformer decoder similar to the GPT paradigm. This design facilitates its autoregressive LLMing capability. With a hidden size of 5120 and an extensive vocabulary size of 52,224, CodeGeeX processes sequence lengths of up to 2048 tokens.
The pre-training of CodeGeeX involved 850 billion tokens, derived from an elaborate corpus of 23 programming languages, leveraging 1,536 Ascend 910 AI processors. This extensive dataset is a mixture of common repositories and supplementary data directly extracted from GitHub, ensuring a diverse and comprehensive pre-training phase.
Multilingual Capabilities and HumanEval-X
To rigorously evaluate CodeGeeX's performance in multilingual settings, the researchers developed HumanEval-X, an extension of the Python-only HumanEval benchmark, now encompassing C++, Java, JavaScript, and Go. With 164 problems translated into these languages, HumanEval-X supports evaluations in code generation and translation, employing functional correctness as the primary metric. CodeGeeX achieved favorable results, outperforming contemporaries such as GPT-J-6B, GPT-NeoX-20B, and InCoder-6.7B in multilingual code generation tasks.
CodeGeeX Applications and User Interaction
CodeGeeX has been integrated into several development environments, including Visual Studio Code and JetBrains, through user-friendly extensions. These tools embody code generation, translation, and explanation features designed to assist programmers, significantly enhancing coding efficiency for a large percentage of users as reported in surveys.
The model's practical utility is evident from its rapid adoption, generating billions of tokens weekly for an active user base. This demonstrates both the reliability and the adaptability of CodeGeeX for real-world programming tasks.
Conclusion and Future Perspectives
While CodeGeeX's multilingual approach to code generation highlights its potential to diversify solutions using various formalized languages, the paper underscores the need for further exploration into model capacity requirements and improved understanding between languages. As the research community continues to explore techniques such as chain-of-thought prompting, the foundational work on CodeGeeX presents a robust platform for both academic inquiry and practical application enhancement in AI-driven code generation.