PaLM 540B: Scalable Language Model

Updated 24 February 2026

PaLM 540B is a dense, decoder-only Transformer with 540B parameters, 118 layers, SwiGLU activations, and multi-query attention that enhance efficiency and performance.
The model is trained on 780B tokens across 124 languages and diverse data sources, achieving robust few-shot and zero-shot results in language understanding and code tasks.
Extensions like U-PaLM 540B and PaLM-E demonstrate state-of-the-art improvements in reasoning, code generation, and multimodal applications including robotic manipulation.

The PaLM 540B Parameter Model refers to the Pathways LLM (PaLM) architecture at 540 billion parameters, representing one of the largest and most capable dense, decoder-only Transformer LLMs to date. Developed within the Pathways system by Google, the model achieves state-of-the-art results across language understanding, generation, reasoning, and code tasks. PaLM forms the foundation for further advances in scaling, adaptation, and multimodal integration.

1. Architectural Specification

PaLM 540B is structured as a dense, decoder-only Transformer employing extensive depth and width to achieve scalability and expressivity. The model consists of 118 Transformer layers with a hidden (model) dimension of 18,432. Each layer uses 48 attention heads of size 256, yielding a total attention width that matches the hidden dimension. The intermediate feed-forward dimension per layer is 73,728 ( $4\times d_{model}$ ). Key architectural modifications include:

SwiGLU activations in MLP blocks instead of standard GELU or ReLU, providing improved parameter and compute efficiency.
@@@@1@@@@ computation, fusing LayerNorm, MLP, and Attention branches: $y = x + \text{MLP}(\text{LN}(x)) + \text{Attention}(\text{LN}(x))$ .
Multi-Query Attention (MQA): sharing key and value projections across attention heads to accelerate autoregressive decoding.
Rotary Position Embeddings (RoPE): enhancing long-range token interpolation.
Shared input and output embeddings (SentencePiece, 256K tokens) without bias terms.
Total parameter count is 540.35 billion; this includes all trainable weights such as attention, feed-forward, embeddings, projections, and auxiliary layers (Chowdhery et al., 2022).

2. Training Data, Tokenization, and Infrastructure

PaLM 540B was trained on 780 billion tokens in a single epoch. The data distribution is: 50% multilingual social-media, 27% filtered web pages, 13% English books, 5% GitHub code (24 languages), 4% Wikipedia, and 1% English news. The model is trained on 124 languages, with English comprising approximately 78% of total tokens. Tokenization uses a lossless, whitespace-preserving SentencePiece model at 256,000 vocabulary items, including ASCII digits split individually and out-of-vocab Unicode fallback.

Training utilized Google’s Pathways ML system, spanning two TPU v4 Pods with 3072 chips each (6144 chips total). Data parallelism (256-way) and 12-way model parallelism were employed with a fully-sharded, 2D GSPMD setup. The model achieved 97% weak scaling; observed hardware utilization (MFU) was 46.2% of peak FLOPs (Chowdhery et al., 2022).

3. Optimization and Scaling

No dropout was used during pretraining; finetuning adopted a dropout rate of 0.1. Optimization employed Adafactor (Adam equivalent with parameter scaling), global norm gradient clipping (1.0), and a learning rate schedule combining an initial warmup with inverse square root decay:

Initial learning rate: $10^{-2}$ for first 10,000 steps, then $10^{-2} / \sqrt{k}$ for step $k$ .
Momentum: $β_1 = 0.9$ , $β_2(k) = 1 - k^{-0.8}$ .
Weight decay: $= lr^2$ per step.
Auxiliary $z\_loss = 10^{-4}\cdot(\log Z)^2$ encourages softmax normalizer stability.

Batch sizes were ramped from 1.05M to 4.2M tokens/step; max context length is 2048 tokens (examples concatenated with end-of-document markers and no padding).

4. Empirical Performance and Scaling Law Analysis

PaLM 540B demonstrated significant advancement in few-shot and zero-shot performance over previous dense large LMs. Benchmark results include:

English NLP: On 29 1-shot tasks, PaLM 540B exceeds prior SOTA, achieving 81.4% (TriviaQA), 88.7% (BoolQ), and 92.6% (SuperGLUE average finetune).
MMLU (5-shot, 57 tasks): Average accuracy 69.3%, surpassing Chinchilla 70B at 67.5%.
BIG-bench: 5-shot average across 150 tasks is ~49, outperforming the human average of 47. Discontinuous (“emergent”) improvements of >+10% are observed on 25% of tasks when scaling from 62B to 540B.
Reasoning/CoT: 58% accuracy on GSM8K (+CoT), SOTA across arithmetic and commonsense benchmarks.
Code generation: HumanEval@100 pass rate of 76.2%, MBPP@80 of 75.0%, and competitive open-domain translation BLEU scores with few-shot prompting.
Multilingual QA: Finetuned TyDiQA-GoldP EM score 80.0%, 1-shot/few-shot performance lags finetuned SOTA but remains competitive.
Average NLU (21 tasks) and NLG (8 tasks) 1-shot: 74.7% and 63.9%, respectively (Chowdhery et al., 2022).

Scaling analysis shows approximately log-linear gains with model size and data. PaLM 540B also exhibits moderate memorization rates (2.4% exact match on 50-token train spans) and increased, but still sub-human, toxicity as model size increases.

5. Adaptation with UL2R (U-PaLM 540B)

The U-PaLM 540B variant adapts PaLM 540B using UL2R, a “mixture-of-denoisers” objective combining prefix-LM and span corruption, conferring true non-causal infilling capabilities and further improving scaling curves. Only 0.16% extra compute (~1.3B tokens, 20,000 steps on 512 TPUv4 chips) is used on the original data mix.

Empirical results indicate:

Compute Savings: U-PaLM achieves the final PaLM 540B performance with roughly half the compute (2.53 × 10³ zFLOPs PaLM vs 1.08 × 10³ zFLOPs U-PaLM; 66.5 vs 69.4 average zero/few-shot task score).
Benchmarks: U-PaLM 540B surpasses baseline on MMLU (70.7% vs 69.3%), BIG-Bench emergent tasks (+3.4 absolute), reasoning (e.g., GSM8K, 58.5% vs 54.9%), and exhibits qualitative improvements in infilling and flexible prompting (Tay et al., 2022).

6. Multimodal and Embodied Extensions

PaLM 540B serves as the LLM backbone in PaLM-E, an embodied multimodal variant (PaLM-E-562B) integrating a Vision Transformer (ViT-22B, 22B parameters) and state-vector encoders. PaLM-E processes tokens from interleaved text, images, and continuous state vectors, embedded via learned projections into a unified token space. The system is trained end-to-end—either by freezing PaLM and updating encoders, or via full finetuning—on a data mixture from web, vision-language, and robotics domains.

Key results for PaLM-E-562B include:

Robotic Manipulation: TAMP VQA and planning (q₂=98.2%, plan feasibility=93.7%, pick-and-place=82.5%), tabletop block-pushing (Task 1: ~90% success), and affordance/failure prediction (F1 ≈ 0.91).
Vision-Language: Zero-shot VQA v2 (80.0%), OK-VQA (66.1%—SOTA for single model), COCO CIDEr=138.7.
Language Tasks: NLU retention 100.4%, NLG retention 96.2% relative to PaLM 540B (no catastrophic forgetting) (Driess et al., 2023).

PaLM-E exemplifies direct extension of large unimodal LMs into scalable, generalist multimodal and embodied reasoning agents by unified token interleaving and prefix encoding.

7. Limitations, Bias, and Ethical Considerations

PaLM 540B exhibits measurable bias and toxicity, e.g., higher anti-Muslim bias and increased toxicity in certain demographic prompts, with moderate mitigation via prompt filtering, output monitoring, and bloom-filter blocking during inference. Memorization, especially of repetitive or formulaic content, increases sublinearly with model scale. The model, as a highly general pretrained system, is not tailored for sensitive/safety-critical tasks without further domain-specific mitigation and caution.

Open challenges include optimal data/model scaling trade-offs, impact of data quality and freshness, effective sparsification (Mixture-of-Experts), and incorporating retrieval or modularity within the Pathways framework. PaLM 540B, as a reference model, catalyzes ongoing research into both fundamental limits of scaling and practical methodologies for efficient adaptation, alignment, and multimodal grounding (Chowdhery et al., 2022, Tay et al., 2022, Driess et al., 2023).