PaLM-Coder 540B: Scalable Code Generation

Updated 6 May 2026

PaLM-Coder 540B is a large-scale, decoder-only Transformer model optimized for code generation, translation, and repair across 24 programming languages.
It achieves state-of-the-art benchmarks, with significant pass@k improvements on tasks like HumanEval and MBPP through extensive fine-tuning on 46.8B tokens.
The model benefits from near-log-linear scaling and high sample efficiency, yet requires careful human oversight to mitigate risks of incorrect or insecure code outputs.

PaLM-Coder 540B is a large-scale code generation LLM derived from the Pathways LLM (PaLM) 540B checkpoint via extensive code-specific fine-tuning. As a densely activated Transformer framework optimized for source code synthesis, translation, and repair across multiple programming languages, PaLM-Coder 540B demonstrates state-of-the-art performance on a range of code understanding and generation benchmarks, achieving substantial improvements over prior models and strong scaling properties (Chowdhery et al., 2022).

1. Model Architecture

PaLM-Coder 540B adopts a decoder-only Transformer design with a “parallel” residual structure, SwiGLU activations, and rotary positional encodings (RoPE). LayerNorm is applied as $\mathrm{LN}(x)=\frac{x-\mu(x)}{\sqrt{\sigma^2(x)+\epsilon}}$ prior to both the multi-head (multi-query) attention and SwiGLU MLP submodules. Each layer computes attention and MLP in parallel with output summed residually:

$y = x + \mathrm{Attn}(\mathrm{LN}(x)) + \mathrm{MLP}(\mathrm{LN}(x))$

The SwiGLU block operates as

$\mathrm{MLP}(x) = W_2(\mathrm{Swish}(xW_1) \odot (xW_3)),\quad \mathrm{Swish}(u) = u\,\mathrm{sigmoid}(u)$

Key architectural hyperparameters are shown below.

Model	Layers	Hidden dim.	Heads	FFN dim.	Parameters
PaLM/PaLM-Coder	118	18432	48	73728	540B

The embedding matrix is shared for inputs and outputs. Kernels and LayerNorms are implemented without bias terms. RoPE encodings support long context, and the head size is fixed at 256. The architecture is multilingual by pre-train design.

2. Training Objective and Optimization

Training uses standard next-token cross-entropy loss with an added softmax normalizer penalty:

$\mathcal{L} = -\frac{1}{T} \sum_{t=1}^T \log p_\theta(x_t \mid x_{<t}) + \lambda (\log Z(x_{<t}))^2, \quad Z = \sum_v \exp(h_v),\quad \lambda=10^{-4}$

Optimization employs Adafactor, parameter-wise equivalent to Adam, with hyperparameters:

$\beta_1=0.9$
$\beta_2=1-k^{-0.8}$ at training step $k$
Learning rate: warmed up to $10^{-2}$ for 10k steps, then decayed $\propto k^{-1/2}$
Gradient clipping at global norm 1.0
Dynamic weight decay proportional to $\mathrm{lr}^2$

Training uses sequences of 2048 tokens, end-to-end packed with a unique [eod] token. The batch size increases stepwise:

Steps 0–50k: 512 sequences (1M tokens per batch)
Steps 50k–115k: 1024 sequences (2M)
Steps 115k–255k: 2048 sequences (4M) This scheduling allows high-throughput distributed training across 6144 TPU v4 chips with Pathways.

3. Code Data Sources and Tokenization

PaLM-Coder is pre-trained on a multilingual language mixture containing 39B code tokens (5% of 780B total) extracted from GitHub, covering 24 programming languages including Java, HTML, JavaScript, Python, PHP, C#, C++, and C. Copyleft-licensed repositories were excluded, and near-duplicate files were filtered using Levenshtein distance.

Tokenization is performed by a 256k-vocabulary SentencePiece model optimized for code and natural language, with explicit preservation of whitespace and single-character digit tokens (e.g., "123.5" tokenized as "1 2 3 . 5"). Out-of-vocabulary Unicode is handled via reversible UTF-8 byte tokens.

Fine-tuning ("PaLM-Coder" variant) is conducted in two stages (~7.75B tokens total, 5.9B Python):

Stage 1 (6.5B tokens): 60% ExtraPythonData, 30% diverse code, 10% natural language
Stage 2 (1.9B tokens): additional Python data After pre-training and fine-tuning, the total code exposure is 46.8B tokens.

4. Benchmark Evaluation

Performance is primarily reported as pass@k—the fraction of code synthesis problems solved by at least one of k generated samples. Comparative results are summarized below for key code tasks.

Task/Metric	PaLM 540B pre-train	Codex-12B / Davinci	PaLM-Coder 540B fine-tuned
HumanEval pass@100	76.2%	72.3% / 81.7%	88.4%
MBPP (3-shot) pass@80	75.0%	84.4%	80.8%
TransCoder pass@25	79.8%	71.7%	82.5%
HumanEval (0-shot)	26.2%	36.0%	36.0%
MBPP (3-shot) pass@1	36.8%	50.4%	47.0%
GSM8K-Python (8-shot)	51.3%	32.1%	50.9%
DeepFix (compile rate)	73.7%	81.1%	82.1%

PaLM-Coder 540B achieves state-of-the-art pass@100 performance on HumanEval (88.4%) and outperforms Codex-12B and Davinci Codex across a range of code completion and translation tasks. Notably, on 0-shot HumanEval, parity is reached with Codex.

5. Qualitative Capabilities and Analysis

PaLM-Coder 540B demonstrates diverse, high-quality code generation:

Python synthesis from docstrings and signatures that pass reference tests
Idiomatic translation from C++ to Python (TransCoder)
C code repair with automated bug-fixing and minimal stylistic edits (DeepFix)

Strengths include robust zero/one/few-shot generalization, consistent gains from scaling, and reliable handling of boilerplate and compilation errors. Fine-tuning with 7.8B code tokens boosts HumanEval pass@100 by +12%, indicating substantial benefit from task-specific adaptation.

However, generated code correctness and security are not guaranteed. Code that compiles may still be logically incorrect or insecure. The model is sensitive to prompting methods, with stylistic and content variations resulting from small prompt changes. Risks include dataset poisoning, memorization attacks, and the necessity of human review and testing for safe deployment. Larger codebase modifications occur if exemplified within prompts.

6. Scaling Effects and Sample Efficiency

Model scaling from 8B to 62B to 540B parameters yields near-log-linear pass@k improvement, with the largest gains observed between 62B and 540B:

HumanEval (pre-train, pass@100): 14% → 26% → 76%
MBPP (pass@80): 14.8% → 36.8% → 75.0% Fine-tuning then brings the 540B checkpoint to 88.4% on HumanEval.

PaLM-Coder 540B achieves a high level of code sample efficiency: pre-training–only 540B model (2.7B Python tokens) matches Codex-12B fine-tuned with 100B Python tokens, suggesting that increased parameter count effectively transfers knowledge across languages—a scaling advantage not solely attributable to the volume of in-language data.

7. Limitations and Ethical Considerations

The model exhibits several practical and security limitations. Compilation does not imply functional correctness or security; model outputs require thorough testing and manual inspection before use in production environments. Prompt sensitivity may lead to inconsistent output quality or style, and the potential for memorization and dataset poisoning underscores the importance of secure, representative training distributions. The release includes analysis on bias, toxicity, and memorization with proposed mitigation strategies, emphasizing a need for ongoing monitoring and responsible usage of largescale code generation systems (Chowdhery et al., 2022).

PaLM-Coder 540B thus represents a highly scalable, multilingual code modeling system that delivers strong performance and transfer properties, while requiring careful user oversight and ethical consideration for production deployment.

Markdown Report Issue Upgrade to Chat

References (1)

PaLM: Scaling Language Modeling with Pathways (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to PaLM-Coder 540B.