Pathways Language Model (PaLM)

Updated 6 May 2026

PaLM is a large-scale, decoder-only Transformer model that uses dense activation and the Pathways system to achieve effective few-shot learning and transfer.
It scales up to 540B parameters, employing advanced optimization and training techniques to excel in multilingual, reasoning, and code generation tasks.
Extensions like PaLM-E and AudioPaLM demonstrate its adaptability to embodied and audio modalities while addressing ethical, bias, and memorization challenges.

The Pathways LLM (PaLM) is a family of large-scale, densely-activated, decoder-only Transformer models developed and trained with Google's Pathways system. The original PaLM (PaLM-1) was introduced to systematically explore the impact of scale on few-shot learning and transfer in language modeling, and subsequently extended to multiple modalities and language settings. With up to 540 billion parameters and training on 780 billion tokens, PaLM demonstrated state-of-the-art results across a broad range of natural language processing, reasoning, multilingual, and code generation tasks, as well as paved the way for later variants incorporating embodied perception (PaLM-E) and audio modalities (AudioPaLM) (Chowdhery et al., 2022, Driess et al., 2023, Rubenstein et al., 2023, Anil et al., 2023).

1. Model Architecture and Design

PaLM adopts a standard stack-of-blocks decoder-only Transformer architecture, implementing several empirically-tuned modifications for both training stability and computational efficiency at massive scales. Each layer consists of:

Multi-head self-attention with rotary positional encoding (RoPE) and multi-query attention. For each head:

$Q = xW^Q,\quad K = xW^K,\quad V = xW^V,\quad \text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_h}}\right) V$

where the head size $d_h = 256$ .

Feed-forward network using SwiGLU activation:

$\text{SwiGLU}(x) = \text{Swish}(x W_A) \odot (x W_B)$

followed by a linear projection.

Parallel residual connections and layer normalization, with no explicit biases in linear or normalization layers.

Key hyperparameters for principal PaLM variants include:

Model	Layers	$d_{\rm model}$	Heads	FFN Dimension	Params
PaLM-8B	32	4,096	16	$4d_{\rm model}$	~8B
PaLM-62B	64	8,192	32	$4d_{\rm model}$	~62B
PaLM-540B	118	18,432	48	$4d_{\rm model}$	540B

The SentencePiece vocabulary consists of 256k whitespace-preserving tokens with lossless reversibility (Chowdhery et al., 2022).

2. Pathways Infrastructure and Scaling

PaLM was the first LLM to fully leverage Google’s Pathways system for scalable and efficient distributed training. Pathways enables “two-way” pod-level data-parallelism, distributing training over two TPU v4 pods with 3,072 chips each (6,144 total). Within each pod, 12-way model parallelism is combined with 256-way data sharding, forming a “2D final” arrangement with no need for pipeline parallelism.

At each training step, clients dispatch half-batches to each pod, forward and backward passes are computed independently, gradients are locally reduced, then all-reduced across pods over the datacenter network.
Parameters are synchronized in lockstep to guarantee bitwise identical replicas.

Model FLOPs Utilization (MFU) is introduced as a hardware-independent metric:

$\text{MFU} = \frac{\text{tokens/s observed}}{\text{tokens/s at device peak FLOPs}}$

With rematerialization, PaLM-540B reaches 46% MFU (and 57.8% Hardware FLOPs Utilization), outpacing prior 500B+ models (Chowdhery et al., 2022).

3. Data, Training Regime, and Stability

PaLM was trained on 780B tokens over more than 100 languages. The mixture is dominated by:

50% multilingual social-media conversations,
27% quality-filtered web pages,
13% English books,
5% deduplicated open-source GitHub code,
4% Wikipedia (multilingual),
1% English news.

Sequence length during training is set to 2,048, with distinct examples separated by a dedicated [eod] token, and no padding.

Optimizer: Adafactor with parameter-scale learning rate scheduling, dynamic weight decay proportional to squared learning rate, $\beta_1=0.9$ , and $l_2$ -norm gradient clipping. Batch size scales from 1M to 4M tokens over the course of training.

Training was made fully bitwise deterministic under JAX+XLA+T5X. Occasional loss spikes, a known pathology in large-scale LLM optimization, are mitigated by rolling back 100 steps and skipping ~200 batches (Chowdhery et al., 2022).

4. Empirical Results and Scaling Trends

PaLM-540B consistently advances few-shot state-of-the-art benchmarks:

English NLP: Outperforms prior SOTA on 28/29 widely-evaluated tasks, e.g., TriviaQA (EM +5.6 over best), HellaSwag, SuperGLUE; 74.7% 1-shot average NLU versus GPT-3 (65.4%) and GLaM-64B (68.7%).
Reasoning: Chain-of-thought (CoT) prompting on GSM8K matches or exceeds supervised results (58% CoT+calculator). Breakthroughs documented in metaphors, proverbs, analogies.
Code generation (PaLM-Coder): HumanEval pass@100 of 88.4% (vs. Codex-12B at 72.3%). Exemplary performance on MBPP, TransCoder, DeepFix tasks.
Machine Translation: Zero-shot BLEU on WMT14 En→Fr peaks at 38.5 (supervised SOTA ~45). Mid/low-resource performance is within 4–5 BLEU of specialized models (Chowdhery et al., 2022, Vilar et al., 2022).
BIG-bench (broad generalization): Outperforms human average and prior API LLMs on 44/58 common tasks (normalized scores >100 on select tasks).

Scaling phenomena: Most tasks follow log-linear performance improvements with increased model FLOPs (“power law”), but ~25% of BIG-bench tasks show discontinuous, “emergent” jumps when moving from 62B to 540B parameters (Chowdhery et al., 2022).

5. Multilingual, Embodied, and Multimodal Extensions

PaLM’s core model exhibits strong transfer to multilingual and code domains. These principles have enabled new model families:

PaLM-E: Extends PaLM to “embodied” multimodal reasoning by injecting images, continuous states, or object-centric slots directly into token sequences as embeddings, without architectural modifications beyond input encoding. PaLM-E-562B (540B language backbone + 22B ViT) demonstrates robust embodied reasoning, robot planning, visual question answering, and V+L generalization. Notably, catastrophic forgetting of NLU/NLG is nearly eliminated at scale (Driess et al., 2023).
AudioPaLM: Fuses PaLM-2 with AudioLM to natively process and generate both speech and text via a unified vocabulary of SentencePiece and discretized audio tokens. Leveraging cross-modal pretraining, AudioPaLM delivers state-of-the-art end-to-end speech recognition and speech-to-speech translation, and can transfer paralinguistic voice cues cross-lingually (Rubenstein et al., 2023).

6. Bias, Toxicity, and Memorization Analysis

Extensive safety and memorization studies reveal several findings:

Bias: Coreference (Winogender) bias lessens at larger scale (PaLM-540B generative accuracy: 84.7% vs. 62B at 71.7%). Gap to human performance persists on “gotcha” anti-stereotype examples.
Toxicity: Continuations tend to amplify prompt toxicity, and model size increases this slope. Toxicity probability for deliberately toxic prompts remains below human baseline but rises with size.
Memorization: For held-out 50-token sequences, PaLM-540B greedily reproduces 0.75% of sequences seen once, up to >40% for frequently seen content (>500x). Code and templated text are overrepresented among memorized spans; overall, memorization increases from 1.6% (8B) to 2.4% (540B) (Chowdhery et al., 2022).

7. Ethical Risks, Mitigation, and Responsible Deployment

PaLM’s scale unlocks transformative capabilities but surfaces distinct risks:

Toxic or biased language generation,
Misinformation propagation,
Privacy compromise through memorization,
Overreliance on spurious reasoning patterns.

Recommended mitigations include per-task fairness and safety metrics, prompt-level toxicity filtering, inference-time repetition (e.g., Bloom filter) blocking, human-in-the-loop validation for critical applications (e.g., medical), increased robustness via targeted instruction fine-tuning, and sustained dataset monitoring for bias and staleness.

PaLM establishes the empirical and engineering foundations for scalable, generalist LLMs and anticipates further research on emergent phenomena, scaling law boundaries, and the structure of responsible, multimodal language intelligence (Chowdhery et al., 2022, Anil et al., 2023, Driess et al., 2023, Rubenstein et al., 2023, Vilar et al., 2022).