Google's Training Supercomputers from TPU v2 to Ironwood: Architectural Stability, Scale, Resilience, Power Efficiency, and Sustainability Across Five Generations

Published 14 Jun 2026 in cs.AR | (2606.15870v1)

Abstract: This paper (to appear in the July/August 2026 issue of IEEE Micro magazine) summarizes five generations of Google s TPUs, from TPU v2 to Ironwood, highlighting their evolution as scalable, resilient, power-efficient, sustainable supercomputers for AI training. It details the TPU s stable architecture, which has surprisingly easily accommodated the rapidly changing deep neural network workloads, such as the rise of Transformers. Key advancements over eight years include 10x increase in HBM capacity and bandwidth per node, a 100x increase in peak node performance, and a 3600x increase in supercomputer performance. The paper also discusses the role of optical circuit switches, built-in self test, and hardware replay in enhancing resilience and how TPU's environmental impact is reduced with substantial improvements in performance per Watt and in carbon emissions per floating point operation. It concludes by identifying six features that may well characterize successful training accelerators of this decade.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper demonstrates that the persistent two-core TPU architecture enables adaptation to shifting deep learning demands across five generations.
The paper details systematic performance gains, including a 10x HBM capacity increase and a 100x node performance boost, enabling unprecedented supercomputer scaling.
The paper highlights advancements in resilience, power efficiency, and sustainability, with improvements such as 3.7x and 3.8x better compute carbon intensity.

Evolution and Architectural Stability of Google's TPU Training Supercomputers

Overview

The paper "Google's Training Supercomputers from TPU v2 to Ironwood: Architectural Stability, Scale, Resilience, Power Efficiency, and Sustainability Across Five Generations" (2606.15870) documents the technical trajectory of Google's TPUs, spanning from TPU v2 to Ironwood. It emphasizes the architectural decisions, system-level scaling, resilience mechanisms, efficiency improvements, and environmental considerations that have collectively defined this lineage of AI training accelerators. Through detailed empirical analysis and system design exposition, the authors demonstrate the architectural stability and adaptive capacity that have allowed TPUs to surmount the challenges inherent in domain-specific ASICs amid rapid shifts in DNN workloads.

System Scaling and Memory Advances

Across five generations, Google’s TPU architecture achieved exponential scaling:

HBM capacity and bandwidth per node was expanded 10x, with Ironwood providing 192 GiB at 7300 GB/sec per TPU, compared to TPU v2’s 16 GiB at 700 GB/sec.
Peak node performance realized a 100x increment, culminating in Ironwood’s 4614 FP8 TFLOPS per TPU.
Aggregate supercomputer-level performance grew by 3600x, with supercomputer size increasing 36x (256 nodes in TPU v2 to 9216 nodes in Ironwood).
Directly-addressable shared HBM memory reached 1.77 petabytes, enabling unprecedented support for extremely large models.

These gains were achieved despite the slowdown in traditional scaling from Dennard and Moore's Law, underscoring effective architectural scaling and component-level enhancements.

Architectural Stability Amid Evolving DNN Workloads

A central claim of the paper is the sustained viability of the underlying microarchitecture:

The two-core-per-TPU design, with large matrix multiply units (MXUs), vector processing units (VPUs), and a compiler-controlled memory hierarchy, has remained intact throughout all training TPU generations.
Early skepticism about the lifespan of domain-specific ASICs was disproven, as the original TPU v2 architecture easily adapted to shifting DNN workloads, notably accommodating the emergence and eventual dominance of Transformer models.
The VLIW instruction model facilitates deterministic, statically scheduled computation and is widened each generation to accommodate expanded parallelism.
Core features—systolic arrays for matrix multiplies, range-oriented narrow floating point formats (BF16/FP8), HBM, custom interconnects (ICI), DMA-controlled scratchpad memory, and robust VPUs—are identified as foundational and now widely adopted by other accelerators.

SparseCores, initially targeting embedding-heavy recommender workloads, evolved to offload a broader set of collective and sparse operations, paralleling TensorCore execution and improving overall throughput for Transformer and Diffusion models.

Resilience and Availability: Optical Circuit Switch Integration

Enhanced system resilience was achieved through:

Modularization and network agility via optical circuit switches (OCSes), beginning with TPU v4; these OCSes allow racks to be independently commissioned and replaced, manage topology changes dynamically, and materially reduce deployment times.
Synchronous data-parallel training is possible at supercomputer scale with >90% goodput, facilitating multi-regional, multi-slice job orchestration.
Ironwood introduced hardware mitigations against silent data corruption (SDC): Functional Built-In Self-Test (FBIST) for MXUs and VPU hardware replay units. These enable continuous health monitoring and rapid isolation of defective units via OCS, ensuring system-level reliability at scale.

The OCS-driven architecture provides a single-tier network, minimizing programming complexity and averts the inefficiencies and design burdens of rack-based failover strategies.

Power Efficiency and Sustainability: Performance per Watt and Compute Carbon Intensity

TPU generations have prioritized power efficiency:

Performance per Watt improved roughly 30x, with Ironwood achieving a 6x jump over TPU v5p.
Emphasis has shifted from performance per TCO to performance per Watt, in response to physical constraints on data center power availability.
Life-cycle assessment (LCA) reveals substantial carbon emissions reductions per floating point operation. Compute Carbon Intensity (CCI)—measured as gCO₂e/ExaFLOP—encompasses both embodied and operational emissions, offering a holistic metric for environmental impact.
Ironwood's operational and embodied CCI improved 3.7x and 3.8x, respectively, relative to TPU v5p.

CCI enables direct computation of emissions for model training, supporting practical carbon budgeting strategies. These sustainability improvements are directly linked to system-level performance increases and design efficiency.

Practical and Theoretical Implications

The architectural stability demonstrated by Google’s TPUs suggests that domain-specific ASICs can maintain adaptability over extended generational cycles, provided sufficient generality in the underlying design. The migration to a range-oriented floating point representation (BF16, FP8) and the adoption of systolic arrays for matrix operations have become industry standards, further validating these early decisions. SparseCores and OCSes—still unique to TPUs—have provided differentiated acceleration and network resilience, likely to influence future training hardware.

Practically, these advances enable:

Rapid onboarding of new models, as stable architectures allow seamless reuse of compiler and software stack optimizations.
Aggressive scaling of supercomputer size, memory capacity, and bandwidth, supporting training workloads for SOTA foundation models and cross-model experiments.
Enhanced sustainability metrics, supporting environmentally conscious expansion of AI infrastructure.

Theoretically, the persistence of this microarchitecture, despite rapid shifts in DNN paradigms, suggests that future accelerators may be similarly defined by a handful of robust system-level features, analogous to the historic trajectories of IBM 360 and x86.

Future Prospects in AI Hardware

Anticipated future developments include:

Continued scaling of memory and interconnect bandwidth, further supporting large context and multimodal models.
Wider adoption of compute carbon intensity as a standard metric for AI accelerator evaluation.
Evolution of hardware-based resilience features to counteract silent data corruption and ensure model reliability at scale.
Deeper integration of compiler-controlled memory hierarchies and deterministic execution models, motivated by both performance and reproducibility requirements.

Given the established architectural stability, the TPU lineage provides a template for future ASIC design, highlighting the importance of system balance, modularization, and environmentally sustainable scaling.

Conclusion

Across five generations, Google’s TPUs—from v2 to Ironwood—have achieved architectural stability, exponential scaling, and enhanced resilience, while setting industry benchmarks in power efficiency and sustainability. The paper argues persuasively for the persistence of key architectural decisions: large systolic MXUs, range-oriented narrow floating formats, HBM memory, custom interconnects, DMA-controlled scratchpad memory, and robust vector units. The integration of optical circuit switches and SparseCores remains distinctly advantageous for supercomputer-scale training. Power efficiency advances and holistic carbon impact assessments position TPUs as pivotal enablers of sustainable AI infrastructure. This trajectory supports the hypothesis that training accelerator architectures of the 2020s will be defined by a core set of features, maintaining adaptability and efficiency in the face of ever-evolving model workloads.

Markdown Report Issue

Paper to Video (Beta)

All Videos Subscribe on YouTube

Whiteboard

Google's Training Supercomputers from TPU v2 to Ironwood: Architectural Stability, Scale, Resilience, Power Efficiency, and Sustainability Across Five Generations

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

Google’s Training Supercomputers from TPU v2 to Ironwood — A Simple Explanation

1) What is this paper about?

This paper tells the story of how Google’s special AI chips—called TPUs (Tensor Processing Units)—grew from their second version (TPU v2) to a much newer one nicknamed Ironwood. It explains how these chips and the huge “pods” (supercomputers made of thousands of TPUs working together) became faster, bigger, more reliable, and more energy‑efficient over about eight years. It also shares what design choices helped TPUs keep up with fast‑changing AI models like Transformers.

2) What questions were the authors trying to answer?

In plain terms, the paper asks:

Can a chip designed years ago still work well for today’s fast‑changing AI models?
How did Google scale TPUs from small systems to giant supercomputers?
How did they make these systems more reliable, so long training jobs don’t fail?
How much more energy‑efficient and climate‑friendly did TPUs get over time?
What design ideas seem to matter most for successful AI training chips this decade?

3) How did they study it?

This is a “big picture” review of five TPU generations (v2, v3, v4, v5p, Ironwood). The authors:

Compared the hardware across generations: speed, memory, networking, size, and cooling.
Described the TPU architecture (the way the chip is organized) and showed how it barely changed, even as it scaled up.
Explained reliability features, like how TPUs detect errors and recover without wasting lots of time.
Used simple, consistent measurements to track progress:
- Performance per Watt (how much work you get for each unit of power).
- Goodput (the “useful” training progress after subtracting time lost to hiccups).
- Compute Carbon Intensity (CCI), which is the grams of CO2 emitted per unit of computation, counting both electricity use and the carbon “cost” to manufacture the hardware.
Gave real engineering examples: special networks that use light to rewire around failures, self‑tests built into the chips, and new number formats (like BF16 and FP8) that make math faster without hurting AI training.

Here are a few technical terms in everyday language:

Matrix Multiply Unit (MXU): Think of this like an assembly line that multiplies big grids of numbers. AI training needs a lot of this. The MXU is built as a “systolic array,” which is like a grid of workers passing numbers along in rhythm to finish big math jobs quickly.
HBM (High Bandwidth Memory): Super‑fast memory placed right next to the chip—like keeping ingredients on the counter instead of running to the pantry every time.
Interconnect (ICI) and 3D Torus network: The high‑speed “roads” that let TPUs talk to each other in a 3D loop, so they can share results fast.
Optical Circuit Switches (OCS): A light‑based “patch panel” that can rewire which TPUs connect to which, in milliseconds. This helps route around broken parts and makes scheduling easier.
SparseCores: Helper units that handle “sparse” data (lots of zeros, like looking up a few items in a huge catalog) and speed up communication jobs.
DMA to scratchpad memory: Like a helper who quietly brings data from the stockroom (HBM) to the workbench (on‑chip memory) so the main workers never have to stop.
VLIW instructions: A “wide” set of instructions that control many units at once, like a conductor’s score guiding the whole orchestra in one go.
BF16 / FP8: Number formats that trade a little precision for a lot of speed and range, which turns out to be great for AI training.

4) What did they find, and why is it important?

Over eight years, TPUs got dramatically better while keeping the same overall design. The main wins:

Much bigger and faster
- About 100× more performance per TPU chip.
- About 3600× more peak performance for a full supercomputer (“pod”).
- Memory (HBM) per chip grew ~10× in size and speed.
- Pods grew from hundreds to over 9000 chips, with far more network bandwidth.
- A huge shared memory space across the pod (up to ~1.77 petabytes), a record for AI supercomputers.
Architecture stayed stable (which is surprising in AI)
- The basic TPU v2 layout still works for Ironwood.
- Two big “TensorCores” per chip remained the standard. Software can make them act like one giant core (“Megacore”), which makes programming simpler.
- Systolic arrays for matrix math (the MXU) kept scaling up.
- Vector units got stronger for the non‑matrix parts of AI models (like activations and normalization).
- HBM stayed the right memory choice.
- The software stack (XLA, now often used via JAX) kept improving on top of the same foundation.
Reliability improved a lot
- Optical switches (OCS) let the system route around problems and make scheduling big jobs easier. You don’t need a perfect system before starting work; you can bring racks online as they’re ready.
- Built‑In Self‑Test (FBIST) in hardware finds chips with hidden issues.
- Hardware replay in the vector unit double‑checks some operations “for free” to catch rare errors.
- Together, these keep “goodput” high (the amount of useful training progress), even when a few parts fail.
Energy efficiency jumped
- Performance per Watt increased by about 30× across generations, with Ironwood delivering a big extra boost. This matters because data centers are limited by how much power they can get and cool.
Sustainability improved
- The paper uses a fairer climate metric called Compute Carbon Intensity (CCI), which counts both:
- Operational carbon (electricity used while running).
- Embodied carbon (emissions from manufacturing the hardware).
- Newer TPUs have much lower emissions per unit of compute, because they finish the same job much faster and more efficiently.
Big design lessons
- The authors highlight six design choices that seem to define successful AI training chips in the 2020s:
- 1) Systolic arrays for fast matrix multiplies.
- 2) Narrow, range‑friendly number formats (like BF16, FP8) instead of wide IEEE floats for everything.
- 3) HBM as main memory.
- 4) Custom, high‑speed links to build large AI supercomputers.
- 5) Software‑controlled memory (scratchpads + DMA) instead of automatic caches.
- 6) Strong vector units for non‑matrix math.

Why this matters: AI models are bigger than ever, especially Transformers. Training them quickly, reliably, and with fewer emissions needs both smart design and scale. TPUs show it’s possible to keep a stable design and still ride the wave of change.

5) So what does this mean for the future?

Faster progress in AI: Bigger and more efficient TPU pods mean researchers and engineers can train huge models faster and more reliably, pushing AI forward.
Practical reliability at massive scale: Features like optical switches and built‑in self‑tests make it possible to run long jobs across thousands of chips without constant breakdowns.
Lower energy and carbon per task: Since data center power is limited, performance per Watt—and the fuller CCI metric—will drive future designs. TPUs show strong gains on both.
Clear design playbook: The six features above give a roadmap for what works in training accelerators. Many other chips are adopting similar ideas.
Stable foundations help everyone: Because TPU architecture stayed steady, software and models improved steadily too. That reduces the time from “new chip” to “real results,” which is great for users and the environment.

In short, Google’s TPUs grew from powerful chips into massive, efficient, and reliable AI supercomputers—without needing to reinvent their core design each time. That combination of stability, scale, and sustainability is likely to guide how the best AI training machines are built in the years ahead.

View Paper Prompt View All Prompts

Knowledge Gaps

Unresolved Knowledge Gaps, Limitations, and Open Questions

Below is a single, consolidated list of concrete gaps and open questions that future work could address:

Architecture/microarchitecture scale limits: What are the practical limits to scaling the current two–TensorCore per chip architecture and the compiler-level Megacore abstraction as model sizes, context lengths, and activation memory footprints continue to grow?
VMEM bottlenecks: Given SRAM density’s slow growth (VMEM only 4× over 8 years), what workloads are now VMEM-limited, and what architectural or packaging alternatives (e.g., more SRAM chiplets, stacked SRAM, NVRAM tiers) best alleviate this?
Remote memory semantics: The paper claims “directly-addressable shared HBM” across the pod; what are the precise programming, coherence/consistency, and protection semantics, and how are contention, partitioning, and security enforced at petabyte scale?
DMA push-only constraint: How often does push-only inter-TPU DMA constrain algorithms or lead to software complexity/performance loss versus full RDMA (with reads)? What would be the costs/benefits of adding safe remote reads in hardware?
SparseCore generality and ROI: Beyond embeddings and a handful of collectives/top-k ops, which additional sparse or irregular kernels benefit materially from SparseCores, and what is their measured end-to-end training speedup across modern LLM, diffusion, and MoE workloads?
Structured sparsity in MXU: Is there hardware support (or planned support) for structured sparsity (e.g., 2:4), block sparsity, or semi-structured sparsity in MXUs, and what are the accuracy/perf trade-offs relative to dense FP8/BF16?
Attention-specific acceleration: Are there dedicated mechanisms for attention/k-v cache management (e.g., blockwise attention, sliding-window kernels, in-MXU softmax/normalization), and how do these compare to vector-only implementations under long-context LLMs?
FP8/FP4 training stability: What are the convergence/robustness envelopes for FP8 (and potential FP4) across diverse architectures (MoE, diffusion, RNN-Transducers), tasks (ASR, vision, RL), and scales, including required loss-scaling and calibration heuristics?
Accumulation precision choices: How often do FP32 accumulations become a bottleneck or a source of nondeterminism, and are mixed-precision accumulation strategies needed for very deep models or very large batch collectives?
Redundant MXU rows efficacy: What is the empirical yield, performance, and field reliability impact of redundant MXU rows, and is one redundant row sufficient at Ironwood scale and process nodes?
HBM reliability/ECC coverage: What are end-to-end ECC/parity protections across HBM stacks, on-die SRAMs (VMEM, registers, instruction memory), ICI links, and OCS optics, and what are observed soft error and wear-out rates at fleet scale?
ICI latency and tail behavior: The paper reports bandwidth but not latency. What are single-hop and multi-hop latencies, queueing/tail latency under large collectives, and the efficacy of congestion control/credit mechanisms at 9K nodes?
OCS reconfiguration impacts: How do OCS reconfiguration delays, control-plane reliability, optical transceiver failure rates, and drift affect long-running jobs’ goodput, and what are the operational playbooks (e.g., live reroute vs. job pause) under faults?
OCS energy/TCO trade-offs: What is the incremental embodied and operational energy/carbon and TCO overhead of OCS infrastructure versus alternative topologies (e.g., electrical fat-tree or dragonfly with per-rack spares)?
Multi-pod/regional scaling: The paper cites >90% goodput across multiple regions; what are the bandwidth/latency, checkpoint cadence, failure domains, and optimizer hyperparameter adjustments needed when stretching slices across pods/regions?
Scheduler policies and fragmentation: How does the OCS-enabled scheduler handle multi-tenant fairness, preemption, defragmentation, and heterogeneity (mixed job sizes), and what is the goodput/utilization frontier at production scale?
Checkpoint/restore overheads: What are checkpoint interval policies, storage bandwidth requirements, and recovery times under typical failure rates, and how do these interact with data parallelism-only training of giant models?
Determinism guarantees: With VLIW, collectives, FP8/BF16 reductions, and compiler fusions, how is strict determinism achieved across thousands of nodes, and what is the remaining nondeterminism budget (if any) acceptable for convergence regression detection?
SDC detection coverage/limits: What quantitative coverage, false positive/negative rates, and time-to-detect do FBIST and VPU hardware replay achieve under real workloads? Is there an MXU-path replay or online redundancy, and what is the runtime overhead envelope?
Environmental/thermal intermittents: How do voltage droops, temperature transients, and power capping interact with SDC rates and replay efficacy, and what adaptive controls (e.g., DVFS, throttling) minimize error bursts without sacrificing goodput?
Liquid cooling sustainability: What are the water use, coolant leakage rates, maintenance burdens, and embodied/operational carbon of the liquid-cooling plant, and how do these factor into LCA beyond chip manufacturing and electricity?
Performance-per-Watt methodology: Reported gains use peak TFLOPS per TDP. What are measured performance-per-Watt and per-goodput on production training workloads, including idle/partial-load periods, power capping, and cooling overhead?
CCI assumptions and regional variability: How sensitive are CCI results to grid carbon intensity (location-based), curtailment/temporal matching of carbon-free energy, and hardware lifetime assumptions (6 years) across different deployment geographies?
CCI calculation inconsistency: The GPT-3 example appears unit-inconsistent (3.14e23 FLOPs × 265 g/EFLOP yields ~83e6 g = ~83 metric tons, not “~83 million metric tons”); audited examples and standardized unit handling are needed.
Embodied carbon granularity: How do differences in packaging (HBM stacks count, chiplet count), OCS gear, racks, and cooling infrastructure contribute to embodied CCI by component, and what design choices most reduce embodied emissions per FLOP?
Software stack portability: With JAX/XLA now primary, what is the maturity, overhead, and feature parity for PyTorch and other ecosystems (dynamo/Inductor, XLA:GPU parity), especially for dynamic shapes, control flow, and custom kernels?
Compile time and dev ergonomics: What are XLA compile times, caching strategies, Pallas kernel development costs, and debugging/profiling tooling gaps for large-model inner loops and fused ops?
Model-parallel strategies: Beyond tensor parallel “Megacore,” what are the limits and best practices for pipeline and sequence parallelism on TPUs, including optimizer sharding and KV cache sharding for very long context inference/training?
MoE and all-to-all stress: How do SparseCores and ICI/OCS handle MoE all-to-all at scale (switch oversubscription, tail latency, imbalance), and what kernel/network co-designs mitigate hotspotting and dropped experts?
Long-context memory pressure: With 1M+ token contexts, how is KV cache managed across HBM and interconnect, what are spill policies, and what perf/accuracy trade-offs arise from compression or blockwise eviction?
Security and multi-tenancy: What are isolation guarantees (VMEM/HBM protection, PCIe/host DMA isolation), memory encryption availability, and side-channel mitigations under multi-tenant training/serving?
Comparative evaluation: The paper lacks head-to-head results against contemporary accelerators (GPUs/other ASICs) on standardized end-to-end training tasks, total cost to train, goodput under failures, and energy/CCI—critical for external validation.
Cost and availability constraints: What are the capital and operational cost profiles (including OCS), supply-chain risks (HBM availability, optics), and their impact on deployment scale and scheduling policy choices?
Future scaling path: What are the identified blockers (reticle limit, package power density, HBM pinout, optical IO) for a next 10× increase in pod scale/performance, and which roadmap options (more chiplets, photonic IO, 3D stacking) are prioritized?

View Paper Prompt View All Prompts

Practical Applications

Immediate Applications

Below are concrete applications that can be deployed now, drawing directly from the paper’s results on architecture, resilience, efficiency, and sustainability.

Sector: AI/Software (industry, academia). High-goodput training of large Transformer and Diffusion models on TPU v5p/Ironwood-class pods using synchronous data-parallel slices (e.g., 2K-node jobs) to reach >90% goodput. Tools/workflows: JAX + XLA with tensor-parallel “Megacore,” deterministic VLIW execution, topology-aware schedulers; frequent checkpoint/restore. Assumptions/dependencies: Access to TPU v5p/Ironwood capacity (e.g., via Google Cloud or colocated pods), model parallelizability, robust input pipelines and storage throughput.
Sector: Advertising, Retail, Search (industry). Acceleration of embedding-heavy recommendation systems by offloading to SparseCores for all-to-all scatters/gathers, Top-K, and small sparse tensor operations. Tools/workflows: XLA fusions, SparseCore kernels, dataflow partitioning of embeddings across supercomputer-addressable HBM. Assumptions/dependencies: Embedding tables engineered for SparseCore granularity; software support for collective offload; dataset features engineered for scatter/gather locality.
Sector: Finance, Healthcare, Safety-critical ML (industry, academia). Improved reliability of long-running training via FBIST and compiler-transparent hardware replay to detect silent data corruption with near-zero overhead. Tools/workflows: Periodic FBIST during burn-in and fleet operation; automatic vector bundle replay monitoring; immediate removal of suspect nodes via reconfigurable interconnect. Assumptions/dependencies: Availability of Ironwood-class hardware; operational processes to quarantine/replace nodes; alignment with compliance/audit requirements.
Sector: Cloud/Datacenter Operations (industry). Higher utilization and faster time-to-production through optically reconfigurable “cube” deployment: pods remain usable during staged bring-up and can schedule slices without requiring contiguous racks. Tools/workflows: OCS-driven reconfigurable 3D torus; scheduler that composes slices from noncontiguous cubes; modular installation/repair playbooks. Assumptions/dependencies: OCS-capable clusters (e.g., TPU v4+); trained ops teams; spare cube capacity policy.
Sector: MLOps/DevEx (industry, academia). Performance-portable kernels and model development with JAX + Pallas to precisely control VMEM/DMAs and vector/MXU usage across TPU generations. Tools/workflows: Kernel autotuning, fusion-friendly graph transformations in XLA, deterministic timing to reproduce perf/accuracy across releases. Assumptions/dependencies: Dev teams able to adopt JAX/Pallas; profiling and compiler literacy; test infrastructure for regression and determinism.
Sector: Sustainability/ESG, Procurement (industry, policy). Immediate adoption of compute carbon intensity (CCI; gCO2e/FLOP) as a planning, procurement, and reporting metric for AI workloads. Tools/workflows: Add FLOP accounting to training pipelines; LCA-informed vendor comparisons; quarterly CO2e/FLOP reporting; carbon-aware cost models. Assumptions/dependencies: Access to embodied emissions data from vendors; reliable electricity emissions factors (location- and market-based); FLOP estimation in MLOps.
Sector: Model Efficiency (industry, academia). Transition to range-oriented numerics (BF16 now, FP8 where validated) to increase throughput and lower energy/emissions while preserving convergence. Tools/workflows: Mixed-precision training recipes; calibration and loss-scaling; automated numeric validation in CI. Assumptions/dependencies: Model families validated under BF16/FP8; tolerance to quantization; monitoring for rare numeric instabilities.
Sector: HPC/AI Infrastructure (industry, academia). Topology-aware job scheduling on torus networks for better goodput and fairness under failures (with OCS rerouting around bad hosts). Tools/workflows: Scheduler plugins aware of 3D torus coordinates and cube availability; preemption and reslicing policies; spare cube pools. Assumptions/dependencies: Access to topology metadata; policy and SRE readiness for dynamic rerouting; multi-tenant fairness governance.
Sector: Data Engineering (industry). Balanced data ingestion with PCIe-connected hosts and storage tailored to TPU throughput to eliminate input bottlenecks during large-scale training. Tools/workflows: End-to-end profiling (host, network, storage); RDMA or high-throughput dataloaders; prefetch with DMAs and sync flags. Assumptions/dependencies: Sufficient host NIC bandwidth; storage parallelism; dataset sharding aligned with training slices.
Sector: Education/Research (academia). Reproducible training experiments leveraging deterministic VLIW execution, stable architecture, and compiler-controlled memory to improve scientific rigor. Tools/workflows: Version-pinned XLA graphs and kernels; deterministic seeds/exec order; open-sourcing of FLOP and CCI measurements with papers. Assumptions/dependencies: Access to TPU time; institutional support for reproducibility artifacts; storage for checkpoints and metadata.
Sector: Governance/Policy (policy). Near-term updates to RFPs and grants to evaluate AI infrastructure on CO2e/FLOP (CCI) and performance/Watt, not just $/performance. Tools/workflows: Procurement templates with embodied and operational emissions; carbon-adjusted TCO models; minimal reporting requirements for vendors. Assumptions/dependencies: Agreement on measurement boundaries; vendor willingness to disclose LCA; alignment with broader ESG frameworks.
Sector: Product (daily life, industry). Carbon labeling and “green mode” for AI features that surface estimated emissions per task (based on FLOPs × CCI) to end users and product teams. Tools/workflows: In-product telemetry tied to FLOP meters; UI affordances for carbon-aware choices; A/B tests for user adoption. Assumptions/dependencies: Product willingness to expose emissions; accurate FLOP estimates per feature; privacy-preserving telemetry.

Long-Term Applications

The following applications are feasible but require further R&D, scaling, ecosystem changes, or standardization.

Sector: Semiconductor/Accelerators (industry). Broad adoption of the paper’s “six key features” (systolic MXUs, BF16/FP8, HBM, custom interconnects, DMA+scratchpad memory, vector units) as a de facto blueprint for 2020s training accelerators. Potential products: Multi-vendor training ASICs with standardized DMA/scratchpad programming models and systolic MXUs; interop toolchains. Assumptions/dependencies: EDA/IP availability; industry convergence on numeric formats and memory hierarchies; software ecosystem maturity.
Sector: Networking/Datacenter Fabrics (industry). Mainstream, multi-vendor optical circuit switching for AI clusters to increase availability, utilization, and modular deployment across racks/rows. Potential products: High-port-count MEMS OCS integrated with fabric managers; topology-agnostic schedulers; optical health monitoring. Assumptions/dependencies: Cost and reliability of OCS at scale; standard APIs to program topology; vendor-neutral management software.
Sector: Reliability/Assurance (industry, policy). Standardized in-silicon reliability primitives (FBIST-like screening, compiler-transparent hardware replay) for SDC detection across accelerators, with certification paths for regulated industries. Potential products: Fleet-wide health telemetry standards; third-party certs for ML reliability SLAs; SDC-rate reporting. Assumptions/dependencies: Hardware vendor support; agreed coverage metrics; regulatory acceptance.
Sector: MLOps/Scheduling (industry, academia). Carbon-aware and grid-aware training schedulers that time-shift and place jobs to minimize CO2e/FLOP while respecting SLAs, leveraging the CCI framework. Potential products: Schedulers integrating live grid carbon intensity and cluster topology; carbon budgets per project; emissions-based preemption policies. Assumptions/dependencies: Elastic workloads; accurate, real-time emissions factors; org-level carbon targets; business acceptance of time-shifting.
Sector: Compiler/Programming Models (industry, academia). DSLs and compilers that natively target DMA-driven scratchpads and deterministic VLIW timing across vendors, with automated fusion and parallelization hints (e.g., “Megacore” abstractions). Potential products: Cross-vendor Pallas-like kernels; portable collective offload API (including SparseCore/NIC offloads); determinism-first build systems. Assumptions/dependencies: Vendor cooperation; open IR specs; long-term investment in compiler toolchains.
Sector: Storage/Memory (industry). Expanding globally addressable HBM pools and memory-centric collectives to support trillion-parameter training and IO-efficient retrieval-augmented models. Potential products: Memory disaggregation for accelerators; HBM-aware key-value layers; near-memory compute extensions to SparseCores. Assumptions/dependencies: Interconnect advances; software for consistency and placement; cost/yield of high-stack HBM.
Sector: Sustainability/ESG (policy, industry). Embedding CCI into regulation, reporting standards, and market instruments (e.g., carbon-fee adjustments based on CO2e/FLOP; grant eligibility tied to emissions transparency). Potential products: National/international standards for AI emissions reporting; audit frameworks; carbon-aware cloud pricing. Assumptions/dependencies: Standards bodies and regulators alignment; reliable LCA data; avoidance of perverse incentives.
Sector: Datacenter Design (industry). Densification through ubiquitous liquid cooling, vertical power delivery, and reconfigurable optical fabric designs to push performance/Watt and lower operational CCI. Potential products: Reference architectures for AI halls; modular manifold kits; liquid-cooled, OCS-native racks. Assumptions/dependencies: Facility retrofits; safety and maintenance protocols; capex/opex trade-off analyses.
Sector: Federated/Regulated AI (industry, academia). Multi-region, synchronous data-parallel training with high goodput for sensitive domains (healthcare/finance) that require regional isolation yet global convergence. Potential products: Region-aware slicing; confidential compute integration; policy-compliant checkpoint replication. Assumptions/dependencies: Legal frameworks for cross-region training; secure networking; robust checkpoint encryption.
Sector: Product/UX (daily life). User-facing carbon budgets and controls for AI features (e.g., “low-carbon training windows,” per-feature emissions caps), backed by FLOP estimation and CCI accounting. Potential products: Consumer dashboards for AI emissions; organizational “carbon SLAs” for product teams. Assumptions/dependencies: Cultural acceptance; accuracy of per-feature FLOP models; alignment with business metrics.
Sector: Security (industry). Leveraging deterministic execution and topology control to harden training pipelines against fault-injection and data-poisoning via anomaly detection in hardware replay/FBIST telemetry. Potential products: Security analytics for accelerator fleets; automatic isolation/reslicing on anomaly detection. Assumptions/dependencies: Telemetry integrity; red-team validation; integration with SOC workflows.
Sector: Robotics/Autonomy (industry, academia). Faster retraining/finetuning cycles for foundation models used in perception and control, enabled by high memory bandwidth and scalable collectives. Potential products: Continual training pipelines for fleets; rapid policy iteration with synthetic data (Diffusion + Transformers). Assumptions/dependencies: Data curation; validation/verification for safety; real-to-sim alignment.

Notes on Cross-Cutting Assumptions and Dependencies

Hardware access: Many immediate benefits assume access to TPU v5p/Ironwood pods and OCS-enabled fabrics (currently unique to Google TPUs).
Model suitability: BF16/FP8 adoption requires model-specific validation to prevent accuracy regressions; some workloads may resist aggressive low-precision numerics.
Software readiness: Realizing DMA/scratchpad and vector/MXU advantages requires teams fluent in JAX/XLA/Pallas and topology-aware scheduling.
Supply chain and facilities: HBM availability, liquid cooling infrastructure, and optical hardware maturity impact timelines and cost.
Measurement: Accurate FLOP accounting and trustworthy embodied/operational emissions data are prerequisites for CCI-based decision-making.
Organizational processes: Reliability (FBIST/replay), carbon-aware scheduling, and modular deployment all depend on mature SRE/MLOps practices and leadership buy-in.

View Paper Prompt View All Prompts

Glossary

Accelerator Wall: A proposed limit where gains from specialized accelerators taper off after early generations. "TPUs have hurdled the Accelerator Wall [Fuchs19] that claims Moore's Law accounts for the majority of benefit after the first couple of generations."
AllGather: A collective communication primitive that gathers data from all nodes and distributes the combined result back to all nodes. "collective operations like AllReduce, AllGather, ReduceScatter, and Broadcast;"
AllReduce: A collective operation that reduces values (e.g., sums) across all nodes and shares the result with all nodes. "supporting common ML communication patterns, like AllReduce."
ASIC: Application-Specific Integrated Circuit; a chip designed for a particular workload rather than general-purpose use. "Skeptics initially warned that an ASIC might be too tailored to existing DNN models, quickly becoming outdated given Al's rapid pace."
bisection bandwidth: The minimum aggregate bandwidth required to cut the network into two equal halves, indicating interconnect capacity at scale. "interconnect bisection bandwidth both grew ~40X;"
Brain Float (BF16): A 16-bit floating-point format with an 8-bit exponent and 7-bit fraction, prioritizing range over precision. "In 16-bit Brain Float format (BF16), for the first time the exponent (8 bits) is larger than its fraction (7 bits)."
Broadcast: A collective operation that sends the same data from one node to all nodes. "collective operations like AllReduce, AllGather, ReduceScatter, and Broadcast;"
carbon dioxide equivalent (CO2e): A metric that expresses the warming impact of various greenhouse gases in terms of an equivalent amount of CO2. "Carbon dioxide equivalent (CO2e) measures the climate impact of GHGes like methane and nitrous oxide by converting them to the amount of CO2 that would cause the same amount of warming over, say, 100 years, using their global warming potential."
chiplet: A modular piece of a larger chip design that can be combined with others in a package. "Less visible are the advances in power delivery and regulation (including vertical power delivery) or the increase in die size, chiplet count, and packaging sophistication."
CISC-like instructions: Complex Instruction Set Computing style instructions that perform multi-step operations with single instructions. "Similar to TPU v1, the units execute CISC-like instructions and operate on variable-length inputs, where instruction execution time is data-dependent."
compute carbon intensity (CCI): Emissions per unit of computation performed (e.g., CO2e per FLOP), including both operational and embodied emissions. "The answer was a new metric: compute carbon intensity (CCI)."
dataflow architecture: A design where computation is organized around the flow of data through specialized units rather than a centralized control. "We consider SparseCore as a 'dataflow' architecture because data flows from memory to various specialized compute units."
Direct Memory Access (DMA): Hardware that transfers data between memory and devices without CPU intervention. "An asynchronous DMA (Direct Memory Access) unit transfers data between HBM and local vector memory."
Diffusion models: Generative models that iteratively refine noise into data samples, now prominent in AI workloads. "Diffusion models are now larger than CNNs."
distributed router: A routing function embedded in each chip that collectively forms the interconnect, avoiding separate router chips. "ICI, the TPU supercomputer interconnect, relies on a 3D Torus topology with a distributed router as part of every TPU chip; it needs no extra chips for TPU-to-TPU communication."
Embodied emissions: Greenhouse gas emissions from manufacturing and the supply chain of hardware, amortized over its lifetime. "Embodied emissions are amortized over six year lifetimes for all TPUs."
ExaFLOP: A unit representing 10¹⁸ floating-point operations (not per second), used to measure fixed computation amounts. "ExaFLOP (1018 FLOPs) was picked so that CO2e is in grams versus a smaller unit."
Fetch Unit: A microarchitectural block that reads data (e.g., activations, parameters) into a local memory for processing. "Each tile also includes a Fetch Unit, a programmable 8-wide SIMD Vector Processing Unit, and a Flush Unit."
FP8: An 8-bit floating-point format used to increase throughput and reduce memory bandwidth for AI workloads. "Ironwood also added support for FP8 arithmetic, which means it can also compute four 512x512 FP8 multiplies."
Flush Unit: A unit responsible for writing back updated parameters or data to memory after computation. "Each tile also includes a Fetch Unit, a programmable 8-wide SIMD Vector Processing Unit, and a Flush Unit."
Functional Built-In Self-Test (FBIST): On-chip logic that runs functional tests to detect latent or emerging hardware faults. "The Functional Built-In Self-Test (FBIST) engine, integrated within the MXU, executes high-coverage functional test patterns during manufacturing and data center burn-in..."
goodput: The effective rate of productive work (e.g., training progress) excluding overhead from retries, failures, or recovery. "Goodput is short for 'good throughput', which in training systems is the rate of good or effective training progress."
HBM (High Bandwidth Memory): Stacked DRAM providing very high memory bandwidth to accelerators via wide interfaces. "HBM (High Bandwidth Memory) capacity and bandwidth per TPU increased ~10X;"
High Level Optimizer (HLO): XLA’s intermediate representation for optimizing and compiling ML computations. "with a 'bridge' that translated from TensorFlow graphs into XLA's High Level Optimizer (HLO) format."
Hyperscalers: Very large cloud and internet companies operating massive data centers and custom infrastructure. "Hyperscalers Alibaba and Amazon (and eventually Microsoft) started their own DNN inference chips."
Inter-Chip Interconnect (ICI): TPU-to-TPU high-speed links used to build the training supercomputer network. "TPU v2 featured four off-chip links (Inter-Chip Interconnect or ICI) and two on-chip links to an on-chip router."
JAX: A high-performance machine learning framework with composable transformations that targets XLA/TPUs. "Today JAX (Just-in-time Auto-differentiated XLA) has become the language and system of choice for programming TPUs, with the Pallas kernel language adding fine-grained control for model developers."
Life-cycle assessment (LCA): A method to quantify environmental impacts across a product’s full life from materials to use. "Google recently completed a life-cycle assessment (LCA) of several TPUs [Schneider25]."
Matrix Multiply Unit (MXU): A specialized compute unit (often a systolic array) dedicated to matrix multiplications in TPUs. "The matrix multiply unit (MXU) is the computational heart of TPUs."
Megacore: A compiler-exposed abstraction that makes multiple physical cores appear as one large logical core for easier programming and resource unification. "has supported tensor parallelization directives that give the illusion of a single large core-called Megacore-unifying HBM capacity and ICI bandwidth in a single effective thread"
Micro-Electro-Mechanical Systems (MEMS): Tiny mechanical devices integrated with electronics, used here for fast optical switching. "3D Micro-Electro-Mechanical Systems (MEMS) mirrors that switch in milliseconds."
Optical circuit switches (OCSes): Reconfigurable optical switching systems that interconnect racks/cubes, improving availability and scheduling. "the first supercomputer to use optical circuit switches (OCSes) [Jouppi 23]."
optical transceivers: Components that convert between electrical and optical signals for high-speed fiber interconnects. "Google advanced the state-of-the-art in reliability and cost of optical transceivers based on 3D Micro-Electro-Mechanical Systems (MEMS) mirrors that switch in milliseconds."
Pallas kernel language: A kernel-level language in JAX offering fine-grained control for custom performance-critical code on TPUs. "with the Pallas kernel language adding fine-grained control for model developers."
PCIe: Peripheral Component Interconnect Express; a high-speed interface connecting accelerators to host CPUs. "via a PCIe-connected CPU host."
pod: A TPU training supercomputer configuration composed of many interconnected TPU nodes. "This paper reviews five generations of TPU training supercomputers2, also called pods."
ReduceScatter: A collective primitive that reduces data across nodes and distributes different reduced parts to different nodes. "collective operations like AllReduce, AllGather, ReduceScatter, and Broadcast;"
scatter/gather: Memory access patterns where elements are read from or written to non-contiguous addresses across nodes. "finer-grained access patterns for scatter/gather."
sea-of-cores: An architecture featuring many relatively simple cores operating in parallel. "They operate in a sea-of-cores configuration, integrating supercomputer-scale HBM and ICI to create a flat, globally addressable memory space."
Silent Data Corruption (SDC): Undetected data errors that can silently degrade correctness or convergence. "Silent Data Corruption (SDC) in compute logic presents a critical challenge to large-scale AI reliability [George26]."
SIMD (Single Instruction, Multiple Data): A parallel processing model where one instruction operates on multiple data elements simultaneously. "a programmable 8-wide SIMD Vector Processing Unit"
slices: Subsets of the supercomputer (by number of chips) allocated to a particular job. "Similar to HPC supercomputers, the workload comprises various scale sizes, termed 'slices,'i.e., 64, 128, ... , 2048 chips."
SparseCore: TPU’s specialized core for sparse operations and embeddings, also used to offload collectives and summarization. "The SparseCore is a domain-specific architecture initially for embedding training [Jouppi23]."
sublane: An additional data-parallel dimension within a vector lane that increases per-cycle parallelism. "known as a sublane, enabling operations on 8 sets of 128-wide vectors per clock cycle."
synchronous data-parallel training: A training approach where replicas synchronize (e.g., via collectives) each step across nodes. "Google employed synchronous data-parallel training to parallelise over multiple 8960-chip TPU v5p pods in multiple data centers for Gemini 2.5 with a goodput of 93% [Gemini25]."
systolic array: A grid of processing elements that rhythmically compute and pass data for efficient matrix operations. "In TPU v2 it was a 128x128 systolic array of multipliers and adders, delivering 32,768 operations per cycle."
tensor parallelization: Splitting tensors across devices/cores to scale model execution and memory. "has supported tensor parallelization directives that give the illusion of a single large core-called Megacore-"
TensorCore: TPU’s main compute core specialized for tensor operations with scalar, vector, and matrix units. "TPU v2 has two TensorCores."
Thermal Dissipation Power (TDP): The maximum amount of heat generated that a system is designed to dissipate under workload, used as a proxy for power. "although it uses peak performance per TDP (Thermal Dissipation Power) Watt rather than measured performance and power running production workloads as Vahdat et al. recommend."
Top-K: An operation that selects the K highest (or lowest) values, often for summarization or pruning. "data summarization operations like Top-K;"
Torus topology: A network layout where nodes form a ring in each dimension, providing wraparound connections for uniform bandwidth/latency. "ICI, the TPU supercomputer interconnect, relies on a 3D Torus topology with a distributed router as part of every TPU chip;"
Vector Memory (VMEM): Scratchpad memory local to vector units/lane slices, explicitly managed by software/DMAs. "Each lane's register files perform loads and stores against its local slice of vector memory (VMEM)."
Vector Processing Unit (VPU): A unit executing vector instructions across many lanes for non-matrix operations. "Ironwood introduces a hardware replay unit for the VPU."
Very Long Instruction Word (VLIW): An architecture that encodes multiple operations per long instruction, exposing instruction-level parallelism to the compiler. "TensorCore's scalar unit fetches complete VLIW (Very Long Instruction Word) bundles of 322 bits from a local instruction memory,"
vertical power delivery: Supplying power through vertical interconnects in the package to improve power integrity/density. "including vertical power delivery"
wraparound links: Connections that link opposite faces of a torus network to complete the ring in each dimension. "To create the wraparound links of a 3D torus, links on opposing sides must connect to the same OCS."

View Paper Prompt View All Prompts

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Continue Learning

Collections

Tweets

HackerNews

Google's Training Supercomputers from TPU v2 to Ironwood: Five Generations (1 point, 0 comments)

Google's Training Supercomputers from TPU v2 to Ironwood: Architectural Stability, Scale, Resilience, Power Efficiency, and Sustainability Across Five Generations

Summary

Evolution and Architectural Stability of Google's TPU Training Supercomputers

Overview

System Scaling and Memory Advances

Architectural Stability Amid Evolving DNN Workloads

Resilience and Availability: Optical Circuit Switch Integration

Power Efficiency and Sustainability: Performance per Watt and Compute Carbon Intensity

Practical and Theoretical Implications

Future Prospects in AI Hardware

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

Google’s Training Supercomputers from TPU v2 to Ironwood — A Simple Explanation

1) What is this paper about?

2) What questions were the authors trying to answer?

3) How did they study it?

4) What did they find, and why is it important?

5) So what does this mean for the future?

Knowledge Gaps

Unresolved Knowledge Gaps, Limitations, and Open Questions

Practical Applications

Immediate Applications

Long-Term Applications

Notes on Cross-Cutting Assumptions and Dependencies

Glossary

Open Problems

Continue Learning

Collections

Tweets

HackerNews

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research