CodeFuse-13B: A Pretrained Multi-lingual Code Large Language Model (2310.06266v2)
Abstract: Code LLMs (Code LLMs) have gained significant attention in the industry due to their wide applications in the full lifecycle of software engineering. However, the effectiveness of existing models in understanding non-English inputs for multi-lingual code-related tasks is still far from well studied. This paper introduces CodeFuse-13B, an open-sourced pre-trained code LLM. It is specifically designed for code-related tasks with both English and Chinese prompts and supports over 40 programming languages. CodeFuse achieves its effectiveness by utilizing a high quality pre-training dataset that is carefully filtered by program analyzers and optimized during the training process. Extensive experiments are conducted using real-world usage scenarios, the industry-standard benchmark HumanEval-x, and the specially designed CodeFuseEval for Chinese prompts. To assess the effectiveness of CodeFuse, we actively collected valuable human feedback from the AntGroup's software development process where CodeFuse has been successfully deployed. The results demonstrate that CodeFuse-13B achieves a HumanEval pass@1 score of 37.10%, positioning it as one of the top multi-lingual code LLMs with similar parameter sizes. In practical scenarios, such as code generation, code translation, code comments, and testcase generation, CodeFuse performs better than other models when confronted with Chinese prompts.
- SantaCoder: don’t reach for the stars! arXiv:2301.03988 [cs.SE]
- PaLM 2 Technical Report. arXiv:2305.10403 [cs.CL]
- Program Synthesis with Large Language Models. arXiv:2108.07732 [cs.PL]
- Layer Normalization. arXiv:1607.06450 [stat.ML]
- GPT-NeoX-20B: An Open-Source Autoregressive Language Model. arXiv:2204.06745 [cs.CL]
- Language Models are Few-Shot Learners. arXiv:2005.14165 [cs.CL]
- Evaluating Large Language Models Trained on Code. arXiv:2107.03374 [cs.LG]
- PaLM: Scaling Language Modeling with Pathways. arXiv:2204.02311 [cs.CL]
- FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. arXiv:2205.14135 [cs.LG]
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805 [cs.CL]
- GPTQ: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323 (2022).
- InCoder: A Generative Model for Code Infilling and Synthesis. arXiv:2204.05999 [cs.SE]
- The Pile: An 800GB Dataset of Diverse Text for Language Modeling. arXiv:2101.00027 [cs.CL]
- Ant Group. 2023. Sparrow. http://sparrow.alipay.com.
- Dan Hendrycks and Kevin Gimpel. 2023. Gaussian Error Linear Units (GELUs). arXiv:1606.08415 [cs.LG]
- Training Compute-Optimal Large Language Models. arXiv:2203.15556 [cs.CL]
- Diederik P. Kingma and Jimmy Ba. 2017. Adam: A Method for Stochastic Optimization. arXiv:1412.6980 [cs.LG]
- The Stack: 3 TB of permissively licensed source code. arXiv:2211.15533 [cs.CL]
- StarCoder: may the source be with you! arXiv:2305.06161 [cs.CL]
- Competition-level code generation with AlphaCode. Science 378, 6624 (dec 2022), 1092–1097. https://doi.org/10.1126/science.abq1158
- Hybrid Inlining: A Framework for Compositional and Context-Sensitive Static Analysis. In Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2023, Seattle, WA, USA, July 17-21, 2023, René Just and Gordon Fraser (Eds.). ACM, 114–126.
- WizardCoder: Empowering Code Large Language Models with Evol-Instruct. arXiv:2306.08568 [cs.CL]
- MicroSoft. 2023. ChatML. https://github.com/openai/openai-python/blob/main/chatml.md.
- Keynote Address: .QL for Source Code Analysis. In Seventh IEEE International Working Conference on Source Code Analysis and Manipulation (SCAM 2007). 3–16. https://doi.org/10.1109/SCAM.2007.31
- CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis. arXiv:2203.13474 [cs.LG]
- OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL]
- Scaling Language Models: Methods, Analysis & Insights from Training Gopher. arXiv:2112.11446 [cs.CL]
- ZeRO: Memory Optimizations Toward Training Trillion Parameter Models. arXiv:1910.02054 [cs.LG]
- Code Llama: Open Foundation Models for Code. arXiv:2308.12950 [cs.CL]
- An Empirical Evaluation of Using Large Language Models for Automated Unit Test Generation. arXiv:2302.06527 [cs.SE]
- PanGu-Coder2: Boosting Large Language Models for Code with Ranking Feedback. arXiv:2307.14936 [cs.CL]
- Pinpoint: fast and precise sparse value flow analysis for million lines of code. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2018, Philadelphia, PA, USA, June 18-22, 2018, Jeffrey S. Foster and Dan Grossman (Eds.). ACM, 693–706.
- Byte Pair Encoding: A Text Compression Scheme That Accelerates Pattern Matching. (09 1999).
- Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. arXiv:1909.08053 [cs.CL]
- RoFormer: Enhanced Transformer with Rotary Position Embedding. arXiv:2104.09864 [cs.CL]
- ERNIE 2.0: A Continual Pre-training Framework for Language Understanding. arXiv:1907.12412 [cs.CL]
- ERNIE: Enhanced Representation through Knowledge Integration. arXiv:1904.09223 [cs.CL]
- Stanford Alpaca: An Instruction-following LLaMA model. https://github.com/tatsu-lab/stanford_alpaca.
- LLaMA: Open and Efficient Foundation Language Models. arXiv:2302.13971 [cs.CL]
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).
- Ben Wang and Aran Komatsuzaki. 2021. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax.
- Software Testing with Large Language Model: Survey, Landscape, and Vision. arXiv:2307.07221 [cs.SE]
- ERNIE 3.0 Titan: Exploring Larger-scale Knowledge Enhanced Pre-training for Language Understanding and Generation. arXiv:2112.12731 [cs.CL]
- CodeT5+: Open Code Large Language Models for Code Understanding and Generation. arXiv:2305.07922 [cs.CL]
- CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation. arXiv:2109.00859 [cs.CL]
- Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics, Online, 38–45.
- On Layer Normalization in the Transformer Architecture. arXiv:2002.04745 [cs.LG]
- Automated Conformance Testing for JavaScript Engines via Deep Compiler Fuzzing. In Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation (PLDI 2021). Association for Computing Machinery, 435–450.
- GLM-130B: An Open Bilingual Pre-trained Model. arXiv:2210.02414 [cs.CL]
- A Survey of Large Language Models. arXiv:2303.18223 [cs.CL]
- CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Evaluations on HumanEval-X. arXiv:2303.17568 [cs.LG]
- Field-Based Static Taint Analysis for Industrial Microservices. In 44th IEEE/ACM International Conference on Software Engineering: Software Engineering in Practice, ICSE (SEIP) 2022, Pittsburgh, PA, USA, May 22-24, 2022. IEEE, 149–150.
- Scalable Compositional Static Taint Analysis for Sensitive Data Tracing on Industrial Micro-Services. In 45th IEEE/ACM International Conference on Software Engineering: Software Engineering in Practice, SEIP@ICSE 2023, Melbourne, Australia, May 14-20, 2023. IEEE, 110–121.
- The Devil is in the Tails: How Long-Tailed Code Distributions Impact Large Language Models. arXiv:2309.03567 [cs.SE]