Google's Training Supercomputers from TPU v2 to Ironwood: Architectural Stability, Scale, Resilience, Power Efficiency, and Sustainability Across Five Generations
Abstract: This paper (to appear in the July/August 2026 issue of IEEE Micro magazine) summarizes five generations of Google s TPUs, from TPU v2 to Ironwood, highlighting their evolution as scalable, resilient, power-efficient, sustainable supercomputers for AI training. It details the TPU s stable architecture, which has surprisingly easily accommodated the rapidly changing deep neural network workloads, such as the rise of Transformers. Key advancements over eight years include 10x increase in HBM capacity and bandwidth per node, a 100x increase in peak node performance, and a 3600x increase in supercomputer performance. The paper also discusses the role of optical circuit switches, built-in self test, and hardware replay in enhancing resilience and how TPU's environmental impact is reduced with substantial improvements in performance per Watt and in carbon emissions per floating point operation. It concludes by identifying six features that may well characterize successful training accelerators of this decade.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
Google’s Training Supercomputers from TPU v2 to Ironwood — A Simple Explanation
1) What is this paper about?
This paper tells the story of how Google’s special AI chips—called TPUs (Tensor Processing Units)—grew from their second version (TPU v2) to a much newer one nicknamed Ironwood. It explains how these chips and the huge “pods” (supercomputers made of thousands of TPUs working together) became faster, bigger, more reliable, and more energy‑efficient over about eight years. It also shares what design choices helped TPUs keep up with fast‑changing AI models like Transformers.
2) What questions were the authors trying to answer?
In plain terms, the paper asks:
- Can a chip designed years ago still work well for today’s fast‑changing AI models?
- How did Google scale TPUs from small systems to giant supercomputers?
- How did they make these systems more reliable, so long training jobs don’t fail?
- How much more energy‑efficient and climate‑friendly did TPUs get over time?
- What design ideas seem to matter most for successful AI training chips this decade?
3) How did they study it?
This is a “big picture” review of five TPU generations (v2, v3, v4, v5p, Ironwood). The authors:
- Compared the hardware across generations: speed, memory, networking, size, and cooling.
- Described the TPU architecture (the way the chip is organized) and showed how it barely changed, even as it scaled up.
- Explained reliability features, like how TPUs detect errors and recover without wasting lots of time.
- Used simple, consistent measurements to track progress:
- Performance per Watt (how much work you get for each unit of power).
- Goodput (the “useful” training progress after subtracting time lost to hiccups).
- Compute Carbon Intensity (CCI), which is the grams of CO2 emitted per unit of computation, counting both electricity use and the carbon “cost” to manufacture the hardware.
- Gave real engineering examples: special networks that use light to rewire around failures, self‑tests built into the chips, and new number formats (like BF16 and FP8) that make math faster without hurting AI training.
Here are a few technical terms in everyday language:
- Matrix Multiply Unit (MXU): Think of this like an assembly line that multiplies big grids of numbers. AI training needs a lot of this. The MXU is built as a “systolic array,” which is like a grid of workers passing numbers along in rhythm to finish big math jobs quickly.
- HBM (High Bandwidth Memory): Super‑fast memory placed right next to the chip—like keeping ingredients on the counter instead of running to the pantry every time.
- Interconnect (ICI) and 3D Torus network: The high‑speed “roads” that let TPUs talk to each other in a 3D loop, so they can share results fast.
- Optical Circuit Switches (OCS): A light‑based “patch panel” that can rewire which TPUs connect to which, in milliseconds. This helps route around broken parts and makes scheduling easier.
- SparseCores: Helper units that handle “sparse” data (lots of zeros, like looking up a few items in a huge catalog) and speed up communication jobs.
- DMA to scratchpad memory: Like a helper who quietly brings data from the stockroom (HBM) to the workbench (on‑chip memory) so the main workers never have to stop.
- VLIW instructions: A “wide” set of instructions that control many units at once, like a conductor’s score guiding the whole orchestra in one go.
- BF16 / FP8: Number formats that trade a little precision for a lot of speed and range, which turns out to be great for AI training.
4) What did they find, and why is it important?
Over eight years, TPUs got dramatically better while keeping the same overall design. The main wins:
- Much bigger and faster
- About 100× more performance per TPU chip.
- About 3600× more peak performance for a full supercomputer (“pod”).
- Memory (HBM) per chip grew ~10× in size and speed.
- Pods grew from hundreds to over 9000 chips, with far more network bandwidth.
- A huge shared memory space across the pod (up to ~1.77 petabytes), a record for AI supercomputers.
- Architecture stayed stable (which is surprising in AI)
- The basic TPU v2 layout still works for Ironwood.
- Two big “TensorCores” per chip remained the standard. Software can make them act like one giant core (“Megacore”), which makes programming simpler.
- Systolic arrays for matrix math (the MXU) kept scaling up.
- Vector units got stronger for the non‑matrix parts of AI models (like activations and normalization).
- HBM stayed the right memory choice.
- The software stack (XLA, now often used via JAX) kept improving on top of the same foundation.
- Reliability improved a lot
- Optical switches (OCS) let the system route around problems and make scheduling big jobs easier. You don’t need a perfect system before starting work; you can bring racks online as they’re ready.
- Built‑In Self‑Test (FBIST) in hardware finds chips with hidden issues.
- Hardware replay in the vector unit double‑checks some operations “for free” to catch rare errors.
- Together, these keep “goodput” high (the amount of useful training progress), even when a few parts fail.
- Energy efficiency jumped
- Performance per Watt increased by about 30× across generations, with Ironwood delivering a big extra boost. This matters because data centers are limited by how much power they can get and cool.
- Sustainability improved
- The paper uses a fairer climate metric called Compute Carbon Intensity (CCI), which counts both:
- Operational carbon (electricity used while running).
- Embodied carbon (emissions from manufacturing the hardware).
- Newer TPUs have much lower emissions per unit of compute, because they finish the same job much faster and more efficiently.
- Big design lessons
- The authors highlight six design choices that seem to define successful AI training chips in the 2020s:
- 1) Systolic arrays for fast matrix multiplies.
- 2) Narrow, range‑friendly number formats (like BF16, FP8) instead of wide IEEE floats for everything.
- 3) HBM as main memory.
- 4) Custom, high‑speed links to build large AI supercomputers.
- 5) Software‑controlled memory (scratchpads + DMA) instead of automatic caches.
- 6) Strong vector units for non‑matrix math.
Why this matters: AI models are bigger than ever, especially Transformers. Training them quickly, reliably, and with fewer emissions needs both smart design and scale. TPUs show it’s possible to keep a stable design and still ride the wave of change.
5) So what does this mean for the future?
- Faster progress in AI: Bigger and more efficient TPU pods mean researchers and engineers can train huge models faster and more reliably, pushing AI forward.
- Practical reliability at massive scale: Features like optical switches and built‑in self‑tests make it possible to run long jobs across thousands of chips without constant breakdowns.
- Lower energy and carbon per task: Since data center power is limited, performance per Watt—and the fuller CCI metric—will drive future designs. TPUs show strong gains on both.
- Clear design playbook: The six features above give a roadmap for what works in training accelerators. Many other chips are adopting similar ideas.
- Stable foundations help everyone: Because TPU architecture stayed steady, software and models improved steadily too. That reduces the time from “new chip” to “real results,” which is great for users and the environment.
In short, Google’s TPUs grew from powerful chips into massive, efficient, and reliable AI supercomputers—without needing to reinvent their core design each time. That combination of stability, scale, and sustainability is likely to guide how the best AI training machines are built in the years ahead.
Knowledge Gaps
Unresolved Knowledge Gaps, Limitations, and Open Questions
Below is a single, consolidated list of concrete gaps and open questions that future work could address:
- Architecture/microarchitecture scale limits: What are the practical limits to scaling the current two–TensorCore per chip architecture and the compiler-level Megacore abstraction as model sizes, context lengths, and activation memory footprints continue to grow?
- VMEM bottlenecks: Given SRAM density’s slow growth (VMEM only 4× over 8 years), what workloads are now VMEM-limited, and what architectural or packaging alternatives (e.g., more SRAM chiplets, stacked SRAM, NVRAM tiers) best alleviate this?
- Remote memory semantics: The paper claims “directly-addressable shared HBM” across the pod; what are the precise programming, coherence/consistency, and protection semantics, and how are contention, partitioning, and security enforced at petabyte scale?
- DMA push-only constraint: How often does push-only inter-TPU DMA constrain algorithms or lead to software complexity/performance loss versus full RDMA (with reads)? What would be the costs/benefits of adding safe remote reads in hardware?
- SparseCore generality and ROI: Beyond embeddings and a handful of collectives/top-k ops, which additional sparse or irregular kernels benefit materially from SparseCores, and what is their measured end-to-end training speedup across modern LLM, diffusion, and MoE workloads?
- Structured sparsity in MXU: Is there hardware support (or planned support) for structured sparsity (e.g., 2:4), block sparsity, or semi-structured sparsity in MXUs, and what are the accuracy/perf trade-offs relative to dense FP8/BF16?
- Attention-specific acceleration: Are there dedicated mechanisms for attention/k-v cache management (e.g., blockwise attention, sliding-window kernels, in-MXU softmax/normalization), and how do these compare to vector-only implementations under long-context LLMs?
- FP8/FP4 training stability: What are the convergence/robustness envelopes for FP8 (and potential FP4) across diverse architectures (MoE, diffusion, RNN-Transducers), tasks (ASR, vision, RL), and scales, including required loss-scaling and calibration heuristics?
- Accumulation precision choices: How often do FP32 accumulations become a bottleneck or a source of nondeterminism, and are mixed-precision accumulation strategies needed for very deep models or very large batch collectives?
- Redundant MXU rows efficacy: What is the empirical yield, performance, and field reliability impact of redundant MXU rows, and is one redundant row sufficient at Ironwood scale and process nodes?
- HBM reliability/ECC coverage: What are end-to-end ECC/parity protections across HBM stacks, on-die SRAMs (VMEM, registers, instruction memory), ICI links, and OCS optics, and what are observed soft error and wear-out rates at fleet scale?
- ICI latency and tail behavior: The paper reports bandwidth but not latency. What are single-hop and multi-hop latencies, queueing/tail latency under large collectives, and the efficacy of congestion control/credit mechanisms at 9K nodes?
- OCS reconfiguration impacts: How do OCS reconfiguration delays, control-plane reliability, optical transceiver failure rates, and drift affect long-running jobs’ goodput, and what are the operational playbooks (e.g., live reroute vs. job pause) under faults?
- OCS energy/TCO trade-offs: What is the incremental embodied and operational energy/carbon and TCO overhead of OCS infrastructure versus alternative topologies (e.g., electrical fat-tree or dragonfly with per-rack spares)?
- Multi-pod/regional scaling: The paper cites >90% goodput across multiple regions; what are the bandwidth/latency, checkpoint cadence, failure domains, and optimizer hyperparameter adjustments needed when stretching slices across pods/regions?
- Scheduler policies and fragmentation: How does the OCS-enabled scheduler handle multi-tenant fairness, preemption, defragmentation, and heterogeneity (mixed job sizes), and what is the goodput/utilization frontier at production scale?
- Checkpoint/restore overheads: What are checkpoint interval policies, storage bandwidth requirements, and recovery times under typical failure rates, and how do these interact with data parallelism-only training of giant models?
- Determinism guarantees: With VLIW, collectives, FP8/BF16 reductions, and compiler fusions, how is strict determinism achieved across thousands of nodes, and what is the remaining nondeterminism budget (if any) acceptable for convergence regression detection?
- SDC detection coverage/limits: What quantitative coverage, false positive/negative rates, and time-to-detect do FBIST and VPU hardware replay achieve under real workloads? Is there an MXU-path replay or online redundancy, and what is the runtime overhead envelope?
- Environmental/thermal intermittents: How do voltage droops, temperature transients, and power capping interact with SDC rates and replay efficacy, and what adaptive controls (e.g., DVFS, throttling) minimize error bursts without sacrificing goodput?
- Liquid cooling sustainability: What are the water use, coolant leakage rates, maintenance burdens, and embodied/operational carbon of the liquid-cooling plant, and how do these factor into LCA beyond chip manufacturing and electricity?
- Performance-per-Watt methodology: Reported gains use peak TFLOPS per TDP. What are measured performance-per-Watt and per-goodput on production training workloads, including idle/partial-load periods, power capping, and cooling overhead?
- CCI assumptions and regional variability: How sensitive are CCI results to grid carbon intensity (location-based), curtailment/temporal matching of carbon-free energy, and hardware lifetime assumptions (6 years) across different deployment geographies?
- CCI calculation inconsistency: The GPT-3 example appears unit-inconsistent (3.14e23 FLOPs × 265 g/EFLOP yields ~83e6 g = ~83 metric tons, not “~83 million metric tons”); audited examples and standardized unit handling are needed.
- Embodied carbon granularity: How do differences in packaging (HBM stacks count, chiplet count), OCS gear, racks, and cooling infrastructure contribute to embodied CCI by component, and what design choices most reduce embodied emissions per FLOP?
- Software stack portability: With JAX/XLA now primary, what is the maturity, overhead, and feature parity for PyTorch and other ecosystems (dynamo/Inductor, XLA:GPU parity), especially for dynamic shapes, control flow, and custom kernels?
- Compile time and dev ergonomics: What are XLA compile times, caching strategies, Pallas kernel development costs, and debugging/profiling tooling gaps for large-model inner loops and fused ops?
- Model-parallel strategies: Beyond tensor parallel “Megacore,” what are the limits and best practices for pipeline and sequence parallelism on TPUs, including optimizer sharding and KV cache sharding for very long context inference/training?
- MoE and all-to-all stress: How do SparseCores and ICI/OCS handle MoE all-to-all at scale (switch oversubscription, tail latency, imbalance), and what kernel/network co-designs mitigate hotspotting and dropped experts?
- Long-context memory pressure: With 1M+ token contexts, how is KV cache managed across HBM and interconnect, what are spill policies, and what perf/accuracy trade-offs arise from compression or blockwise eviction?
- Security and multi-tenancy: What are isolation guarantees (VMEM/HBM protection, PCIe/host DMA isolation), memory encryption availability, and side-channel mitigations under multi-tenant training/serving?
- Comparative evaluation: The paper lacks head-to-head results against contemporary accelerators (GPUs/other ASICs) on standardized end-to-end training tasks, total cost to train, goodput under failures, and energy/CCI—critical for external validation.
- Cost and availability constraints: What are the capital and operational cost profiles (including OCS), supply-chain risks (HBM availability, optics), and their impact on deployment scale and scheduling policy choices?
- Future scaling path: What are the identified blockers (reticle limit, package power density, HBM pinout, optical IO) for a next 10× increase in pod scale/performance, and which roadmap options (more chiplets, photonic IO, 3D stacking) are prioritized?
Practical Applications
Immediate Applications
Below are concrete applications that can be deployed now, drawing directly from the paper’s results on architecture, resilience, efficiency, and sustainability.
- Sector: AI/Software (industry, academia). High-goodput training of large Transformer and Diffusion models on TPU v5p/Ironwood-class pods using synchronous data-parallel slices (e.g., 2K-node jobs) to reach >90% goodput. Tools/workflows: JAX + XLA with tensor-parallel “Megacore,” deterministic VLIW execution, topology-aware schedulers; frequent checkpoint/restore. Assumptions/dependencies: Access to TPU v5p/Ironwood capacity (e.g., via Google Cloud or colocated pods), model parallelizability, robust input pipelines and storage throughput.
- Sector: Advertising, Retail, Search (industry). Acceleration of embedding-heavy recommendation systems by offloading to SparseCores for all-to-all scatters/gathers, Top-K, and small sparse tensor operations. Tools/workflows: XLA fusions, SparseCore kernels, dataflow partitioning of embeddings across supercomputer-addressable HBM. Assumptions/dependencies: Embedding tables engineered for SparseCore granularity; software support for collective offload; dataset features engineered for scatter/gather locality.
- Sector: Finance, Healthcare, Safety-critical ML (industry, academia). Improved reliability of long-running training via FBIST and compiler-transparent hardware replay to detect silent data corruption with near-zero overhead. Tools/workflows: Periodic FBIST during burn-in and fleet operation; automatic vector bundle replay monitoring; immediate removal of suspect nodes via reconfigurable interconnect. Assumptions/dependencies: Availability of Ironwood-class hardware; operational processes to quarantine/replace nodes; alignment with compliance/audit requirements.
- Sector: Cloud/Datacenter Operations (industry). Higher utilization and faster time-to-production through optically reconfigurable “cube” deployment: pods remain usable during staged bring-up and can schedule slices without requiring contiguous racks. Tools/workflows: OCS-driven reconfigurable 3D torus; scheduler that composes slices from noncontiguous cubes; modular installation/repair playbooks. Assumptions/dependencies: OCS-capable clusters (e.g., TPU v4+); trained ops teams; spare cube capacity policy.
- Sector: MLOps/DevEx (industry, academia). Performance-portable kernels and model development with JAX + Pallas to precisely control VMEM/DMAs and vector/MXU usage across TPU generations. Tools/workflows: Kernel autotuning, fusion-friendly graph transformations in XLA, deterministic timing to reproduce perf/accuracy across releases. Assumptions/dependencies: Dev teams able to adopt JAX/Pallas; profiling and compiler literacy; test infrastructure for regression and determinism.
- Sector: Sustainability/ESG, Procurement (industry, policy). Immediate adoption of compute carbon intensity (CCI; gCO2e/FLOP) as a planning, procurement, and reporting metric for AI workloads. Tools/workflows: Add FLOP accounting to training pipelines; LCA-informed vendor comparisons; quarterly CO2e/FLOP reporting; carbon-aware cost models. Assumptions/dependencies: Access to embodied emissions data from vendors; reliable electricity emissions factors (location- and market-based); FLOP estimation in MLOps.
- Sector: Model Efficiency (industry, academia). Transition to range-oriented numerics (BF16 now, FP8 where validated) to increase throughput and lower energy/emissions while preserving convergence. Tools/workflows: Mixed-precision training recipes; calibration and loss-scaling; automated numeric validation in CI. Assumptions/dependencies: Model families validated under BF16/FP8; tolerance to quantization; monitoring for rare numeric instabilities.
- Sector: HPC/AI Infrastructure (industry, academia). Topology-aware job scheduling on torus networks for better goodput and fairness under failures (with OCS rerouting around bad hosts). Tools/workflows: Scheduler plugins aware of 3D torus coordinates and cube availability; preemption and reslicing policies; spare cube pools. Assumptions/dependencies: Access to topology metadata; policy and SRE readiness for dynamic rerouting; multi-tenant fairness governance.
- Sector: Data Engineering (industry). Balanced data ingestion with PCIe-connected hosts and storage tailored to TPU throughput to eliminate input bottlenecks during large-scale training. Tools/workflows: End-to-end profiling (host, network, storage); RDMA or high-throughput dataloaders; prefetch with DMAs and sync flags. Assumptions/dependencies: Sufficient host NIC bandwidth; storage parallelism; dataset sharding aligned with training slices.
- Sector: Education/Research (academia). Reproducible training experiments leveraging deterministic VLIW execution, stable architecture, and compiler-controlled memory to improve scientific rigor. Tools/workflows: Version-pinned XLA graphs and kernels; deterministic seeds/exec order; open-sourcing of FLOP and CCI measurements with papers. Assumptions/dependencies: Access to TPU time; institutional support for reproducibility artifacts; storage for checkpoints and metadata.
- Sector: Governance/Policy (policy). Near-term updates to RFPs and grants to evaluate AI infrastructure on CO2e/FLOP (CCI) and performance/Watt, not just $/performance. Tools/workflows: Procurement templates with embodied and operational emissions; carbon-adjusted TCO models; minimal reporting requirements for vendors. Assumptions/dependencies: Agreement on measurement boundaries; vendor willingness to disclose LCA; alignment with broader ESG frameworks.
- Sector: Product (daily life, industry). Carbon labeling and “green mode” for AI features that surface estimated emissions per task (based on FLOPs × CCI) to end users and product teams. Tools/workflows: In-product telemetry tied to FLOP meters; UI affordances for carbon-aware choices; A/B tests for user adoption. Assumptions/dependencies: Product willingness to expose emissions; accurate FLOP estimates per feature; privacy-preserving telemetry.
Long-Term Applications
The following applications are feasible but require further R&D, scaling, ecosystem changes, or standardization.
- Sector: Semiconductor/Accelerators (industry). Broad adoption of the paper’s “six key features” (systolic MXUs, BF16/FP8, HBM, custom interconnects, DMA+scratchpad memory, vector units) as a de facto blueprint for 2020s training accelerators. Potential products: Multi-vendor training ASICs with standardized DMA/scratchpad programming models and systolic MXUs; interop toolchains. Assumptions/dependencies: EDA/IP availability; industry convergence on numeric formats and memory hierarchies; software ecosystem maturity.
- Sector: Networking/Datacenter Fabrics (industry). Mainstream, multi-vendor optical circuit switching for AI clusters to increase availability, utilization, and modular deployment across racks/rows. Potential products: High-port-count MEMS OCS integrated with fabric managers; topology-agnostic schedulers; optical health monitoring. Assumptions/dependencies: Cost and reliability of OCS at scale; standard APIs to program topology; vendor-neutral management software.
- Sector: Reliability/Assurance (industry, policy). Standardized in-silicon reliability primitives (FBIST-like screening, compiler-transparent hardware replay) for SDC detection across accelerators, with certification paths for regulated industries. Potential products: Fleet-wide health telemetry standards; third-party certs for ML reliability SLAs; SDC-rate reporting. Assumptions/dependencies: Hardware vendor support; agreed coverage metrics; regulatory acceptance.
- Sector: MLOps/Scheduling (industry, academia). Carbon-aware and grid-aware training schedulers that time-shift and place jobs to minimize CO2e/FLOP while respecting SLAs, leveraging the CCI framework. Potential products: Schedulers integrating live grid carbon intensity and cluster topology; carbon budgets per project; emissions-based preemption policies. Assumptions/dependencies: Elastic workloads; accurate, real-time emissions factors; org-level carbon targets; business acceptance of time-shifting.
- Sector: Compiler/Programming Models (industry, academia). DSLs and compilers that natively target DMA-driven scratchpads and deterministic VLIW timing across vendors, with automated fusion and parallelization hints (e.g., “Megacore” abstractions). Potential products: Cross-vendor Pallas-like kernels; portable collective offload API (including SparseCore/NIC offloads); determinism-first build systems. Assumptions/dependencies: Vendor cooperation; open IR specs; long-term investment in compiler toolchains.
- Sector: Storage/Memory (industry). Expanding globally addressable HBM pools and memory-centric collectives to support trillion-parameter training and IO-efficient retrieval-augmented models. Potential products: Memory disaggregation for accelerators; HBM-aware key-value layers; near-memory compute extensions to SparseCores. Assumptions/dependencies: Interconnect advances; software for consistency and placement; cost/yield of high-stack HBM.
- Sector: Sustainability/ESG (policy, industry). Embedding CCI into regulation, reporting standards, and market instruments (e.g., carbon-fee adjustments based on CO2e/FLOP; grant eligibility tied to emissions transparency). Potential products: National/international standards for AI emissions reporting; audit frameworks; carbon-aware cloud pricing. Assumptions/dependencies: Standards bodies and regulators alignment; reliable LCA data; avoidance of perverse incentives.
- Sector: Datacenter Design (industry). Densification through ubiquitous liquid cooling, vertical power delivery, and reconfigurable optical fabric designs to push performance/Watt and lower operational CCI. Potential products: Reference architectures for AI halls; modular manifold kits; liquid-cooled, OCS-native racks. Assumptions/dependencies: Facility retrofits; safety and maintenance protocols; capex/opex trade-off analyses.
- Sector: Federated/Regulated AI (industry, academia). Multi-region, synchronous data-parallel training with high goodput for sensitive domains (healthcare/finance) that require regional isolation yet global convergence. Potential products: Region-aware slicing; confidential compute integration; policy-compliant checkpoint replication. Assumptions/dependencies: Legal frameworks for cross-region training; secure networking; robust checkpoint encryption.
- Sector: Product/UX (daily life). User-facing carbon budgets and controls for AI features (e.g., “low-carbon training windows,” per-feature emissions caps), backed by FLOP estimation and CCI accounting. Potential products: Consumer dashboards for AI emissions; organizational “carbon SLAs” for product teams. Assumptions/dependencies: Cultural acceptance; accuracy of per-feature FLOP models; alignment with business metrics.
- Sector: Security (industry). Leveraging deterministic execution and topology control to harden training pipelines against fault-injection and data-poisoning via anomaly detection in hardware replay/FBIST telemetry. Potential products: Security analytics for accelerator fleets; automatic isolation/reslicing on anomaly detection. Assumptions/dependencies: Telemetry integrity; red-team validation; integration with SOC workflows.
- Sector: Robotics/Autonomy (industry, academia). Faster retraining/finetuning cycles for foundation models used in perception and control, enabled by high memory bandwidth and scalable collectives. Potential products: Continual training pipelines for fleets; rapid policy iteration with synthetic data (Diffusion + Transformers). Assumptions/dependencies: Data curation; validation/verification for safety; real-to-sim alignment.
Notes on Cross-Cutting Assumptions and Dependencies
- Hardware access: Many immediate benefits assume access to TPU v5p/Ironwood pods and OCS-enabled fabrics (currently unique to Google TPUs).
- Model suitability: BF16/FP8 adoption requires model-specific validation to prevent accuracy regressions; some workloads may resist aggressive low-precision numerics.
- Software readiness: Realizing DMA/scratchpad and vector/MXU advantages requires teams fluent in JAX/XLA/Pallas and topology-aware scheduling.
- Supply chain and facilities: HBM availability, liquid cooling infrastructure, and optical hardware maturity impact timelines and cost.
- Measurement: Accurate FLOP accounting and trustworthy embodied/operational emissions data are prerequisites for CCI-based decision-making.
- Organizational processes: Reliability (FBIST/replay), carbon-aware scheduling, and modular deployment all depend on mature SRE/MLOps practices and leadership buy-in.
Glossary
- Accelerator Wall: A proposed limit where gains from specialized accelerators taper off after early generations. "TPUs have hurdled the Accelerator Wall [Fuchs19] that claims Moore's Law accounts for the majority of benefit after the first couple of generations."
- AllGather: A collective communication primitive that gathers data from all nodes and distributes the combined result back to all nodes. "collective operations like AllReduce, AllGather, ReduceScatter, and Broadcast;"
- AllReduce: A collective operation that reduces values (e.g., sums) across all nodes and shares the result with all nodes. "supporting common ML communication patterns, like AllReduce."
- ASIC: Application-Specific Integrated Circuit; a chip designed for a particular workload rather than general-purpose use. "Skeptics initially warned that an ASIC might be too tailored to existing DNN models, quickly becoming outdated given Al's rapid pace."
- bisection bandwidth: The minimum aggregate bandwidth required to cut the network into two equal halves, indicating interconnect capacity at scale. "interconnect bisection bandwidth both grew ~40X;"
- Brain Float (BF16): A 16-bit floating-point format with an 8-bit exponent and 7-bit fraction, prioritizing range over precision. "In 16-bit Brain Float format (BF16), for the first time the exponent (8 bits) is larger than its fraction (7 bits)."
- Broadcast: A collective operation that sends the same data from one node to all nodes. "collective operations like AllReduce, AllGather, ReduceScatter, and Broadcast;"
- carbon dioxide equivalent (CO2e): A metric that expresses the warming impact of various greenhouse gases in terms of an equivalent amount of CO2. "Carbon dioxide equivalent (CO2e) measures the climate impact of GHGes like methane and nitrous oxide by converting them to the amount of CO2 that would cause the same amount of warming over, say, 100 years, using their global warming potential."
- chiplet: A modular piece of a larger chip design that can be combined with others in a package. "Less visible are the advances in power delivery and regulation (including vertical power delivery) or the increase in die size, chiplet count, and packaging sophistication."
- CISC-like instructions: Complex Instruction Set Computing style instructions that perform multi-step operations with single instructions. "Similar to TPU v1, the units execute CISC-like instructions and operate on variable-length inputs, where instruction execution time is data-dependent."
- compute carbon intensity (CCI): Emissions per unit of computation performed (e.g., CO2e per FLOP), including both operational and embodied emissions. "The answer was a new metric: compute carbon intensity (CCI)."
- dataflow architecture: A design where computation is organized around the flow of data through specialized units rather than a centralized control. "We consider SparseCore as a 'dataflow' architecture because data flows from memory to various specialized compute units."
- Direct Memory Access (DMA): Hardware that transfers data between memory and devices without CPU intervention. "An asynchronous DMA (Direct Memory Access) unit transfers data between HBM and local vector memory."
- Diffusion models: Generative models that iteratively refine noise into data samples, now prominent in AI workloads. "Diffusion models are now larger than CNNs."
- distributed router: A routing function embedded in each chip that collectively forms the interconnect, avoiding separate router chips. "ICI, the TPU supercomputer interconnect, relies on a 3D Torus topology with a distributed router as part of every TPU chip; it needs no extra chips for TPU-to-TPU communication."
- Embodied emissions: Greenhouse gas emissions from manufacturing and the supply chain of hardware, amortized over its lifetime. "Embodied emissions are amortized over six year lifetimes for all TPUs."
- ExaFLOP: A unit representing 1018 floating-point operations (not per second), used to measure fixed computation amounts. "ExaFLOP (1018 FLOPs) was picked so that CO2e is in grams versus a smaller unit."
- Fetch Unit: A microarchitectural block that reads data (e.g., activations, parameters) into a local memory for processing. "Each tile also includes a Fetch Unit, a programmable 8-wide SIMD Vector Processing Unit, and a Flush Unit."
- FP8: An 8-bit floating-point format used to increase throughput and reduce memory bandwidth for AI workloads. "Ironwood also added support for FP8 arithmetic, which means it can also compute four 512x512 FP8 multiplies."
- Flush Unit: A unit responsible for writing back updated parameters or data to memory after computation. "Each tile also includes a Fetch Unit, a programmable 8-wide SIMD Vector Processing Unit, and a Flush Unit."
- Functional Built-In Self-Test (FBIST): On-chip logic that runs functional tests to detect latent or emerging hardware faults. "The Functional Built-In Self-Test (FBIST) engine, integrated within the MXU, executes high-coverage functional test patterns during manufacturing and data center burn-in..."
- goodput: The effective rate of productive work (e.g., training progress) excluding overhead from retries, failures, or recovery. "Goodput is short for 'good throughput', which in training systems is the rate of good or effective training progress."
- HBM (High Bandwidth Memory): Stacked DRAM providing very high memory bandwidth to accelerators via wide interfaces. "HBM (High Bandwidth Memory) capacity and bandwidth per TPU increased ~10X;"
- High Level Optimizer (HLO): XLA’s intermediate representation for optimizing and compiling ML computations. "with a 'bridge' that translated from TensorFlow graphs into XLA's High Level Optimizer (HLO) format."
- Hyperscalers: Very large cloud and internet companies operating massive data centers and custom infrastructure. "Hyperscalers Alibaba and Amazon (and eventually Microsoft) started their own DNN inference chips."
- Inter-Chip Interconnect (ICI): TPU-to-TPU high-speed links used to build the training supercomputer network. "TPU v2 featured four off-chip links (Inter-Chip Interconnect or ICI) and two on-chip links to an on-chip router."
- JAX: A high-performance machine learning framework with composable transformations that targets XLA/TPUs. "Today JAX (Just-in-time Auto-differentiated XLA) has become the language and system of choice for programming TPUs, with the Pallas kernel language adding fine-grained control for model developers."
- Life-cycle assessment (LCA): A method to quantify environmental impacts across a product’s full life from materials to use. "Google recently completed a life-cycle assessment (LCA) of several TPUs [Schneider25]."
- Matrix Multiply Unit (MXU): A specialized compute unit (often a systolic array) dedicated to matrix multiplications in TPUs. "The matrix multiply unit (MXU) is the computational heart of TPUs."
- Megacore: A compiler-exposed abstraction that makes multiple physical cores appear as one large logical core for easier programming and resource unification. "has supported tensor parallelization directives that give the illusion of a single large core-called Megacore-unifying HBM capacity and ICI bandwidth in a single effective thread"
- Micro-Electro-Mechanical Systems (MEMS): Tiny mechanical devices integrated with electronics, used here for fast optical switching. "3D Micro-Electro-Mechanical Systems (MEMS) mirrors that switch in milliseconds."
- Optical circuit switches (OCSes): Reconfigurable optical switching systems that interconnect racks/cubes, improving availability and scheduling. "the first supercomputer to use optical circuit switches (OCSes) [Jouppi 23]."
- optical transceivers: Components that convert between electrical and optical signals for high-speed fiber interconnects. "Google advanced the state-of-the-art in reliability and cost of optical transceivers based on 3D Micro-Electro-Mechanical Systems (MEMS) mirrors that switch in milliseconds."
- Pallas kernel language: A kernel-level language in JAX offering fine-grained control for custom performance-critical code on TPUs. "with the Pallas kernel language adding fine-grained control for model developers."
- PCIe: Peripheral Component Interconnect Express; a high-speed interface connecting accelerators to host CPUs. "via a PCIe-connected CPU host."
- pod: A TPU training supercomputer configuration composed of many interconnected TPU nodes. "This paper reviews five generations of TPU training supercomputers2, also called pods."
- ReduceScatter: A collective primitive that reduces data across nodes and distributes different reduced parts to different nodes. "collective operations like AllReduce, AllGather, ReduceScatter, and Broadcast;"
- scatter/gather: Memory access patterns where elements are read from or written to non-contiguous addresses across nodes. "finer-grained access patterns for scatter/gather."
- sea-of-cores: An architecture featuring many relatively simple cores operating in parallel. "They operate in a sea-of-cores configuration, integrating supercomputer-scale HBM and ICI to create a flat, globally addressable memory space."
- Silent Data Corruption (SDC): Undetected data errors that can silently degrade correctness or convergence. "Silent Data Corruption (SDC) in compute logic presents a critical challenge to large-scale AI reliability [George26]."
- SIMD (Single Instruction, Multiple Data): A parallel processing model where one instruction operates on multiple data elements simultaneously. "a programmable 8-wide SIMD Vector Processing Unit"
- slices: Subsets of the supercomputer (by number of chips) allocated to a particular job. "Similar to HPC supercomputers, the workload comprises various scale sizes, termed 'slices,'i.e., 64, 128, ... , 2048 chips."
- SparseCore: TPU’s specialized core for sparse operations and embeddings, also used to offload collectives and summarization. "The SparseCore is a domain-specific architecture initially for embedding training [Jouppi23]."
- sublane: An additional data-parallel dimension within a vector lane that increases per-cycle parallelism. "known as a sublane, enabling operations on 8 sets of 128-wide vectors per clock cycle."
- synchronous data-parallel training: A training approach where replicas synchronize (e.g., via collectives) each step across nodes. "Google employed synchronous data-parallel training to parallelise over multiple 8960-chip TPU v5p pods in multiple data centers for Gemini 2.5 with a goodput of 93% [Gemini25]."
- systolic array: A grid of processing elements that rhythmically compute and pass data for efficient matrix operations. "In TPU v2 it was a 128x128 systolic array of multipliers and adders, delivering 32,768 operations per cycle."
- tensor parallelization: Splitting tensors across devices/cores to scale model execution and memory. "has supported tensor parallelization directives that give the illusion of a single large core-called Megacore-"
- TensorCore: TPU’s main compute core specialized for tensor operations with scalar, vector, and matrix units. "TPU v2 has two TensorCores."
- Thermal Dissipation Power (TDP): The maximum amount of heat generated that a system is designed to dissipate under workload, used as a proxy for power. "although it uses peak performance per TDP (Thermal Dissipation Power) Watt rather than measured performance and power running production workloads as Vahdat et al. recommend."
- Top-K: An operation that selects the K highest (or lowest) values, often for summarization or pruning. "data summarization operations like Top-K;"
- Torus topology: A network layout where nodes form a ring in each dimension, providing wraparound connections for uniform bandwidth/latency. "ICI, the TPU supercomputer interconnect, relies on a 3D Torus topology with a distributed router as part of every TPU chip;"
- Vector Memory (VMEM): Scratchpad memory local to vector units/lane slices, explicitly managed by software/DMAs. "Each lane's register files perform loads and stores against its local slice of vector memory (VMEM)."
- Vector Processing Unit (VPU): A unit executing vector instructions across many lanes for non-matrix operations. "Ironwood introduces a hardware replay unit for the VPU."
- Very Long Instruction Word (VLIW): An architecture that encodes multiple operations per long instruction, exposing instruction-level parallelism to the compiler. "TensorCore's scalar unit fetches complete VLIW (Very Long Instruction Word) bundles of 322 bits from a local instruction memory,"
- vertical power delivery: Supplying power through vertical interconnects in the package to improve power integrity/density. "including vertical power delivery"
- wraparound links: Connections that link opposite faces of a torus network to complete the ring in each dimension. "To create the wraparound links of a 3D torus, links on opposing sides must connect to the same OCS."
Collections
Sign up for free to add this paper to one or more collections.