Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ERNIE-Code: Beyond English-Centric Cross-lingual Pretraining for Programming Languages (2212.06742v2)

Published 13 Dec 2022 in cs.CL, cs.LG, cs.PL, and cs.SE

Abstract: Software engineers working with the same programming language (PL) may speak different natural languages (NLs) and vice versa, erecting huge barriers to communication and working efficiency. Recent studies have demonstrated the effectiveness of generative pre-training in computer programs, yet they are always English-centric. In this work, we step towards bridging the gap between multilingual NLs and multilingual PLs for LLMs. We release ERNIE-Code, a unified pre-trained LLM for 116 NLs and 6 PLs. We employ two methods for universal cross-lingual pre-training: span-corruption LLMing that learns patterns from monolingual NL or PL; and pivot-based translation LLMing that relies on parallel data of many NLs and PLs. Extensive results show that ERNIE-Code outperforms previous multilingual LLMs for PL or NL across a wide range of end tasks of code intelligence, including multilingual code-to-text, text-to-code, code-to-code, and text-to-text generation. We further show its advantage of zero-shot prompting on multilingual code summarization and text-to-text translation. We release our code and pre-trained checkpoints.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Yekun Chai (18 papers)
  2. Shuohuan Wang (30 papers)
  3. Chao Pang (23 papers)
  4. Yu Sun (226 papers)
  5. Hao Tian (146 papers)
  6. Hua Wu (191 papers)
Citations (32)