Papers
Topics
Authors
Recent
Search
2000 character limit reached

TPU v4 Pods: Architecture & Performance

Updated 6 May 2026
  • TPU v4 Pods are advanced ML systems comprising 4096 custom ASIC chips with SparseCore acceleration and optical interconnects, designed for high-performance deep learning.
  • The system architecture features dynamic optical switching and flexible slicing that reconfigures connections within milliseconds to optimize diverse workloads.
  • Empirical benchmarks demonstrate a 1.13 PFLOPS peak per pod with significant energy and CO₂e reductions, achieving up to 3.5× speedup over previous generations.

Tensor Processing Unit (TPU) v4 pods are optically reconfigurable machine learning supercomputers designed and deployed by Google since 2020. Each pod comprises 4096 custom ASIC chips optimized for large-scale ML workloads, interconnected via 48 optical circuit switches (OCSes) that enable dynamic, user-defined network topologies. Each TPU v4 chip integrates SparseCores for efficient embedding acceleration, supporting training and inference of massive deep learning models with high bandwidth and low latency. The system architecture emphasizes energy efficiency, scalability, flexible slicing, and reduced environmental impact compared to conventional data center accelerators (Jouppi et al., 2023).

1. Physical System Architecture and Pod Layout

A TPU v4 pod is engineered as a hierarchical assembly of chips, boards, CPU hosts, and racks:

  • Each TPU v4 ASIC (7 nm, <600 mm²) mounts with 4 stacks of HBM2 memory (32 GiB per chip), forming 64 chips per rack (organized as 16 boards with 4 chips each).
  • For I/O, each board features four PCI-Express connectors (host interface) and sixteen OSFP optical connectors (for Inter-Chip Interconnect, ICI).
  • The internal mesh layout per board is a 2×2 arrangement, with four embedded ICI links forming the mesh.
  • Four chips connect to one CPU host via PCIe (4:1 ratio).
  • One pod aggregates 64 racks, totaling 4096 chips and 1024 CPU hosts.
  • Power consumption at the chip level (ASIC+HBM) under production workloads spans 90 W (idle) to 192 W (max), with the mean at 170 W. Per-rack draw including CPUs and supporting infrastructure is 12–15 kW. A full pod runs at 0.8–1.0 MW (including cooling, facility PUE ≈ 1.10 yields ~1.1 MW total).

2. Optical Interconnects, Topologies, and Dynamic Reconfiguration

The TPU v4 fabric leverages the Google Palomar OCS, a passive optical switching system employing MEMS mirrors (switch time ≤ 5 ms) with 136×136 port matrices (128 active, 8 spares):

  • The main building block is a 4×4×4 cube of 64 chips; each face presents 16 fiber links for a total of 96 connections. Wrap-around is achieved by logical port pairing, halving the physical link count to 48 in–out pairs per block.
  • A full pod is composed of 64 such blocks interconnected by 48 OCSes, controlling the high-radix optical mesh.
  • Users may request arbitrary logical slices (of the form 4i×4j×4k4^i×4^j×4^k, where $0 < i ≤ j ≤ k$), with the OCS reconfiguring millisecond-latency connections once per job.
  • Supported topologies include 2D, 3D, and twisted 3D tori, with the latter (e.g., k×k×2kk×k×2k) reducing max hop count for improved throughput (e.g., in a 4×4×8 torus, all-to-all throughput is 1.63× higher than regular 3D tori).
  • Compared to InfiniBand (IB-400), TPU v4’s ICI link (50 GB/s per direction) is twice as fast per fiber. Replacing the OCS with IB would require approximately twelve times the switches and much higher power.
  • OCS-based fabrics deliver low per-hop latency (50 ns chip router delay) without packet buffering; overall latency for hh electrical hops plus one OCS fabric is hthop+toch \cdot t_{hop} + t_{oc}, where toct_{oc} is negligible except when reconfiguring (O\mathcal{O}(1 ms)).

3. Compute and Memory Performance

Key performance metrics for TPU v4 pods are as follows:

  • Per-chip peak throughput is 275 TFLOPS (bfloat16/INT8); per-pod peak (bfloat16) is 4096 × 275 = 1.13 PFLOPS.
  • Compared to TPU v3 (per-chip: 123 TFLOPS, per-pod: 0.126 PFLOPS), TPU v4 is 2.1× faster, with a 2.7× improvement in performance/Watt (mean chip power: 170 W vs. 175 W).
  • System-wide HBM2 memory totals 128 TiB (4096 × 32 GiB), with aggregate bandwidth of 4.9 PB/s (4096 × 1200 GB/s).
  • On-chip SRAM (CMEM + VMEM) per chip is 160 MiB (128 MiB + 32 MiB), improving performance for RNNs and small-matrix computations.
  • Global embedding tables are distributed over all HBM2 memory, with access facilitated uniformly via the ICI, providing a logically flat address space.

4. SparseCore Architecture and Embedding Acceleration

Each TPU v4 features the SparseCore (SC) accelerator, dedicated to embedding-related operations:

  • The architecture comprises 16 tiles per SC, each integrating a Fetch Unit (for HBM reads to a 2.5 MiB scratchpad), an 8-wide SIMD scVPU (shared with TensorCore VPU), and a Flush Unit (embedding updates).
  • Specialized cross-channel units (Sort, Sparse Reduce, Concat, Fork, DMA) execute variable-length gather/scatter and reductions across tile memory.
  • SC occupies 5% of chip area and consumes 5% of chip power.
  • End-to-end embedding lookup—globally distributed and all-to-all across chips—achieves performance that scales with fabric bisection bandwidth. SC delivers 5×–7× higher performance for embedding table operations relative to CPU DRAM or external server-based embeddings.
  • Offloads include embedding lookups, dynamic deduplication, and sparse reduction, which overlap with main TensorCore operation for higher throughput.

5. Scalability, Availability, and Scheduling

TPU v4 pods enable flexible, robust scaling for ML workloads:

  • System slices from 64 up to 4096 chips are possible, with arbitrary (power-of-four) block-shaped partitions instantiated via the OCS without physical re-cabling.
  • OCS reconfiguration (∼1 ms) is incurred per job and is amortized across multi-hour training runs, resulting in negligible runtime overhead.
  • Goodput (effective system throughput accounting for failures) remains ≥99.5% even with host availability as low as 99.0%: OCS re-routes around failed hosts/trays on demand.
  • Job scheduling is simplified: contiguous idle chip blocks are unnecessary; the system can allocate any 43 available blocks for a new slice.
  • Empirically, training LLMs achieves sustained utilization of ~60% of peak (0.68 PFLOPS/1.13 PFLOPS) over long periods (e.g., 50 days for PaLM).
  • Some DNN architectures (e.g., CNN0, RNN0, RNN1, BERT1) scale near-linearly up to 3000 chips.
  • Speedup over TPU v3 at matched slice sizes: dense workloads are 1.5–2.0× faster; for DLRM0 and RNN1 (512 chips), speedups reach 3.0–3.5×.

6. Comparative Evaluation and Environmental Impact

TPU v4 pods are benchmarked against other ML training DSAs:

  • In MLPerf Training 2.0, a 4096-chip pod is 1.15× faster on BERT and 1.67× faster on ResNet than 8× NVIDIA A100; Graphcore IPU Bow performance is 0.06–0.07× that of TPU v4 for these workloads.
  • Per chip, TPU v4 (TDP ~192 W) achieves similar peak throughput to A100 (312 TFLOPS, 400 W TDP), but pods use 1.3–1.9× less power at scale.
  • For matched MLPerf throughput (ResNet/BERT), A100 consumes 1.93× more power (380 W vs. 197 W) in 64-chip slices.
  • Energy and CO₂e analysis: TPU v4 pods in Google’s warehouse-scale cloud (PUE 1.10, 90% carbon-free energy in Oklahoma) use 2–6× less energy and emit ~20× less CO₂e per training job than on-premises ML DSA systems (PUE 1.57, grid carbon intensity 0.475 kgCO₂e/kWh).
Feature TPU v4 Pod NVIDIA A100 Graphcore MK2 IPU
Peak TFLOPS (bf16) 1.13 PFLOPS
On-chip SRAM (SRAM) 160 MiB/chip 40 MiB/chip 900 MiB/chip
HBM2 Bandwidth 1200 GB/s/chip 2039 GB/s/chip
MLPerf ResNet Speedup–A100 1.67× baseline (1.00×) 0.22×

A plausible implication is that the optically reconfigurable fabric substantially reduces the number of network elements and associated energy/power overhead, while SparseCore delivers multiplicative speedup for embedding-centric models. The environmental data further suggest a significant advantage for large-scale, energy-efficient ML deployment in sustainable, centralized data centers.

(Jouppi et al., 2023)

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TPU v4 Pods.