Papers
Topics
Authors
Recent
Search
2000 character limit reached

Grace ARM Neoverse V2 CPU Overview

Updated 2 June 2026
  • Grace ARM Neoverse V2 CPU is a high-performance, out-of-order core that balances integer and floating-point operations with a deep, wide-issue pipeline.
  • It leverages automatic write-allocate evasion to minimize redundant memory transfers, enhancing efficiency without requiring software intervention.
  • Empirical OSACA modeling and benchmark validation demonstrate robust in-core performance with sustained frequency under heavy workloads.

The Grace ARM Neoverse V2 CPU is an out-of-order (OoO), high-performance core designed by Arm Ltd. and implemented in Nvidia’s Grace Superchip for high-performance computing (HPC). Its microarchitectural design prioritizes both integer and floating-point throughput, advanced memory-handling features—including automatic write-allocate (WA) evasion—and delivers a robust in-core performance model validated with targeted benchmark suites. Key strengths include a wide-issue superscalar frontend, extensive execution port architecture, high memory bandwidth, and novel support for WA evasion without software intervention (Laukemann et al., 2024).

1. Microarchitectural Design and Execution Pipeline

Neoverse V2, as implemented in Grace, features a deep, out-of-order pipeline segregating integer and floating-point execution. The frontend pipeline fetches up to 4 instructions per cycle, decodes and renames up to 8 µ-ops, and dispatches to a wide OoO window, supporting a maximum retire bandwidth of 8 µ-ops/cycle.

The execution backend is distributed across 17 issue ports, with functional unit mapping as follows:

Port Range Functionality Bandwidth per Cycle
0–5 Integer ALUs 6 × single-cycle
6–9 FP/ASIMD units 4 × SIMD (16 B, 128 bit)
10–12 Load ports 3 × 128 B
13–14 Store addr/data 2 × 128 B
15–16 Branch, SFU N/A

ILP is maximized through independent routing of integer and floating point µ-ops onto their own port clusters, and register renaming is employed to break false dependencies. The scheduler issues one µ-op per port each cycle, subject to a global dispatch width of approximately 8 µ-ops/cycle.

2. Cache Hierarchy and Memory Subsystem

Grace employs a three-level cache hierarchy optimized for HPC workloads:

  • L1: Private, 64 KiB, 64-B lines, single-cycle hit latency.
  • L2: Private, 1 MiB, 256-B lines, exclusive, 10–12 cycle access latency.
  • L3: 114 MiB shared per chip, inclusive, 30–40 cycle latency.

The main memory system uses four ccNUMA domains and LPDDR5X DRAM, supporting up to a 546 GB/s theoretical bandwidth per chip and achieving 467 GB/s measured sustainable throughput, corresponding to 87% bandwidth utilization. This bandwidth outperforms Genoa's 360/461 GB/s (78%) and is comparable to Sapphire Rapids' 273/307 GB/s (89%) (Laukemann et al., 2024).

3. Write-Allocate (WA) Evasion and Memory Traffic Model

Traditional write-allocate caches incur memory traffic penalties when stores miss in L1: a full line is fetched (read-allocate) before being overwritten and evicted, resulting in twice the store volume (traffic ratio R=2R = 2). Grace's "SpecI2M-style" automatic WA evasion suppresses the read-allocate when a full cache line overwrite is detected, avoiding redundant memory transfers.

  • No software flags or non-temporal (NT) stores are required.
  • WA evasion is always active, independent of core count or store pattern.
  • All tested store-only kernel classes benefit.

Competitor implementations are less comprehensive: Golden Cove (SPR) only triggers SpecI2M WA evasion at ≥80% bandwidth usage and caps traffic reduction at ~25%, while Zen 4 (Genoa) provides no HW-enabled WA evasion except via NT stores (Laukemann et al., 2024).

The memory traffic ratio R:=Mactual/DR := M_\text{actual} / D (where DD is store volume) approaches 1 for Grace under all workloads, compared to RR between 1.75–2 for SPR and 2 for Genoa (unless NT stores are used).

4. In-Core Performance Modeling

The Open Source Architecture Code Analyzer (OSACA) was extended to model Neoverse V2 with the following bottlenecks per iteration:

  • Cports=maxk(uk)C_\text{ports} = \max_k (u_k): port pressure (max µ-ops per port)
  • Cissue=kuk/ωC_\text{issue} = \left\lceil \sum_k u_k / \omega \right\rceil: issue width constraint (ω=8\omega = 8)
  • CdepC_\text{dep}: critical-path latency (sum of dependent instruction latencies, empirically 2–14 cycles depending on opcode)

Predicted lower-bound cycles per iteration: Cpred=max(Cports,Cissue,Cdep)C_\text{pred} = \max(C_\text{ports}, C_\text{issue}, C_\text{dep}).

Validation on 416 microbenchmarks (stencils, STREAM-type, reductions) indicated 96% of kernels were conservatively under-predicted (i.e., CpredmeasuredC_\text{pred} \le \text{measured} cycles). Root prediction error (RPE) was ≤10% for 37% of the set, ≤20% for 44%, and mean RPE (where R:=Mactual/DR := M_\text{actual} / D0) was 30% (Laukemann et al., 2024).

5. Comparative Throughput and Latency

Abridged instruction throughput (in double-precision (DP) elements/cycle) and operation latency (cycles) for Neoverse V2, Golden Cove, and Zen 4:

Operation V2 Thru. Cove Thru. Zen4 Thru. V2 Lat. Cove Lat. Zen4 Lat.
Gather 1/4 1/3 1/8
Vector ADD 1.0 4.0 2.0 3 5 4
Vector FMA 0.5 4.0 2.0 4 5 4
Scalar ADD 0.4 0.25 0.2 2 4 3
Scalar DIV 0.25 0.2 0.2 10 14 12

V2 demonstrates lower vector ADD/FMA throughput than its competitors, but competitive scalar throughput and lower scalar ADD latency (Laukemann et al., 2024). Sustained frequency under arithmetic-heavy loads remains constant at 3.4 GHz for Grace, compared to frequency drops in Golden Cove (down to 2.0 GHz, 53% of peak) and Zen 4 (down to 3.1 GHz, 84% of peak).

6. OSACA Tool Extensions for Neoverse V2

OSACA was outfitted with a full 17-port model, empirical µ-op mappings, and measured in-core latencies/throughputs (via sources such as ibench and the OoO-framework). Special attention was given to unique aspects of ARM’s rename/dispatch, and the WA evasion mechanism was modeled by disabling write-allocate ports for store-only kernels. OSACA’s V2 model achieved an average under-prediction error of 30% (compared to LLVM-MCA’s 34%), with only one outlier over 2× error across 416 workloads, yielding the most accurate public ARM in-core model to date (Laukemann et al., 2024).

7. Context and Significance

Grace’s Neoverse V2 implementation, with its architectural focus on memory efficiency (via WA evasion), balanced integer/FP throughput, and sustained performance under heavy vector loads, positions it as a leading option for HPC and memory-bound system design. Its contributions to model-driven architectural evaluation—most notably, through OSACA’s validated in-core model—enable precise lower-bound performance estimates and facilitate system-level analyses in comparative processor research (Laukemann et al., 2024). The unified automatic WA evasion differentiates Grace from other high-core-count CPUs and reduces the onus on compiler and application developers to exploit hardware memory efficiency.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Grace ARM Neoverse V2 CPU.