CodeGeeX: Multilingual Code Generation Model
- CodeGeeX is a large-scale multilingual model that generates and translates code across 23 programming languages using a decoder-only Transformer architecture.
- It is pre-trained on 850 billion tokens from diverse code repositories and evaluated with the HumanEval-X benchmark for functional code accuracy.
- Its open-source release and integration into major IDEs have driven high adoption, significantly enhancing coding efficiency for a global developer community.
CodeGeeX is a large-scale, multilingual code generation and translation model based on the Transformer architecture. Designed to generate and translate source code across 23 programming languages, CodeGeeX leverages 13 billion parameters and is pre-trained on 850 billion tokens from diverse code repositories. It introduces HumanEval-X, a multilingual extension of the HumanEval benchmark, for comprehensive evaluation of code generation and translation capabilities. CodeGeeX demonstrates competitive performance with state-of-the-art code models of similar or larger scale and is fully open-sourced, supporting integration into major IDEs and widespread use among professional developers (Zheng et al., 2023).
1. Model Architecture and Pre-Training Objective
CodeGeeX employs a decoder-only Transformer with 39 layers, each layer comprising 40 attention heads and a hidden size of . The model utilizes a feed-forward network of dimension , FastGELU activations, and layer normalization with . Positional information is encoded using learnable positional embeddings, supporting sequences of up to 2,048 tokens. Following the decoder stack, a “query” layer aggregates all outputs, succeeded by a linear projection tied to the input token embeddings, yielding a vocabulary of tokens.
The model is trained with the standard autoregressive next-token prediction objective:
where are tokenized inputs, is the one-hot target for the next token, and denotes the model parameters. This setup directly optimizes for syntax-correct, coherent code generation.
2. Pre-Training Corpus and Tokenization
The pre-training corpus comprises 850 billion code tokens, representing over five epochs of data, aggregated from several large-scale sources: The Pile (GitHub repositories with at least 100 stars, 23 languages), CodeParrot (public Python dataset), and additional targeted GitHub scrapes for Python, Java, and C++ (with ≥1 star, 1 KB–100 KB in size, average line length ≤100 characters, ≥40% letters, and not autogenerated).
A language breakdown (of 158 billion annotated tokens) allocates the majority to C++ (28.5%), Python (26.7%), Java (16.0%), JavaScript (7.1%), C (6.7%), Go (4.7%), and HTML (3.1%). Additional languages—Shell, PHP, CSS, TypeScript, SQL, TeX, Rust, Objective-C, Scala, Kotlin, Pascal, Fortran, R, CUDA, C#, Objective-C++—each constitute less than 2% of the total.
Tokenization adopts GPT-2 BPE with an initial vocabulary of 50,000 tokens, expanded to 52,224 by incorporating “extra-whitespace” tokens <|extratoken_X|> (where for consecutive spaces), enabling fine-grained code text representation. CodeGeeX treats code and natural-language comments identically, ensuring the preservation of meaningful variable and function names.
3. HumanEval-X: Multilingual Benchmark Design
To evaluate multilingual code generation and translation, CodeGeeX introduces HumanEval-X, which extends the Python-centric HumanEval dataset to include C++, Java, JavaScript, and Go. Each of the 164 original HumanEval Python problems is manually rewritten in these four additional languages, resulting in 820 problem-solution pairs.
Tasks in HumanEval-X are defined as follows:
- Code Generation: The model receives a function declaration and docstring; the target output is the function body.
- Code Translation: The input is the function declaration in the target language and the canonical solution in the source language (without the docstring); the target is the equivalent solution in the target language.
Evaluation utilizes exact functional correctness measured by pass@k:
where is the number of samples, is the count of passing samples, and (Zheng et al., 2023).
4. Quantitative Performance and Multilingual Competitiveness
CodeGeeX demonstrates strong performance on code generation tasks when benchmarked against GPT-J-6B, GPT-NeoX-20B, InCoder-6.7B, and CodeGen-Multi-6B/16B. In aggregate, CodeGeeX achieves higher pass@1 and comparable or better pass@100 metrics compared to CodeGen-Multi-16B.
| Language | pass@1 (%) | pass@10 (%) | pass@100 (%) |
|---|---|---|---|
| Python | 22.89 | 39.57 | 60.92 |
| C++ | 17.06 | 32.21 | 51.00 |
| Java | 20.04 | 36.70 | 58.42 |
| JavaScript | 17.59 | 32.28 | 56.33 |
| Go | 14.43 | 25.68 | 47.14 |
| Average | 18.40 | 33.29 | 54.76 |
Allocating samples across languages proportionally to their corpus size (“multilingual budget allocation”) further increases pass@100 on Python from 60.92% to 62.95%. This suggests that balanced multilingual sampling in generation strategies improves overall correctness in code generation tasks.
In code translation, the fine-tuned variant CodeGeeX-13B-FT (trained on XLCoST plus additional Go data) surpasses CodeGen-Multi-16B in 11 of 20 translation pairs. For example: Java→Python achieves pass@100 of 95.13%, Python→Java 85.84%, JavaScript→C++ 89.30%, and Go→Python 93.57%.
5. Integration, User Adoption, and Empirical Utility
CodeGeeX is accessible through extensions for Visual Studio Code, JetBrains IDEs, and Tencent Cloud Studio. Supported functionalities include code completion, code generation, code translation, code explanation, and customizable prompts. Usage statistics since the model’s public release indicate tens of thousands of active users per week, with an average of 250+ API calls per user per workday and a total throughput of approximately 4.7 billion generated tokens per week.
A user survey comprising 168 respondents—spanning students, researchers, and professional developers—reports that 83.4% experience increased coding efficiency. Average satisfaction ratings on a 0–5 scale across ease of use, reliability, features, visuals, and speed range from 4.0 to 4.3.
6. Open-Source Release and Ecosystem Contributions
CodeGeeX, released in September 2022 at https://github.com/THUDM/CodeGeeX, provides full access to model code, pre-trained weights (including INT8-quantized variants), inference APIs, and optimized FastTransformer kernels with both PyTorch and TensorFlow (Ascend and NVIDIA) support. The HumanEval-X dataset and Docker images for benchmarking are also included, along with IDE extension code and example usage.
The project is notable as the first fully open, large-scale (13B) multilingual code generation model with public end-to-end pre-training recipes. The HumanEval-X benchmark enables robust, functionally grounded multilingual evaluation for both generation and translation, addressing the limitations of string similarity-based evaluation methods. A plausible implication is that this openness facilitates further research and the development of new multilingual code intelligence models.
7. Significance and Impact
CodeGeeX establishes a high-water mark for open-access, large-scale, multilingual code generation models. The combination of model scale, broad language coverage, and balanced mixed-language pre-training demonstrably enhances functional code correctness and translation fidelity. Its integration into widely-used development tools and positive empirical feedback suggest practical benefits in developer productivity. The availability of the HumanEval-X benchmark is a key resource, supporting comparative assessment and further methodological advances in multilingual code modeling (Zheng et al., 2023).