- The paper presents a heterogeneous RISC-V accelerator chip featuring three compute tiles (VEC, STX, VRP) tailored for distinct HPC workloads.
- The chip delivers high performance with up to 256 FP64 elements per instruction and 64 GFLOPS per tile, validated across scientific and ML benchmarks.
- The design integrates modular RISC-V cores and Arm CPUs via a coherent NoC, demonstrating practical steps towards European technological sovereignty.
Authoritative Summary of "EPAC: The Last Dance" (2604.12715)
Introduction and Motivation
The "EPAC: The Last Dance" paper details the design, implementation, and silicon bring-up of EPAC, a RISC-V-based accelerator chip realized via the European Processor Initiative (EPI). EPAC directly addresses strategic European objectives to establish technological sovereignty in HPC processor ecosystems, motivated by vulnerabilities from reliance on non-European hardware supply chains. The architectural philosophy centers on leveraging the RISC-V ISA for maximal flexibility and open, vendor-neutral design, in tandem with a dual-track system architecture featuring Arm CPUs and specialized RISC-V accelerators. EPAC distinguishes itself by integrating three fundamentally different RISC-V compute tiles (VEC, STX, VRP) for distinct workload profiles, providing a comparative architectural testbed within a unified platform.
EPAC Chip Architecture
EPAC is fabricated in GlobalFoundries 22FDX FDSOI technology, occupying 27 mm² with ~300 million transistors. All compute tiles interface through a Coherent Hub Interface (CHI) in a NoC configuration with distributed L2 caches and off-chip access via SerDes. The chip achieves operational stability across process corners between 768 MHz (SS, 0.72 V, 125° C) and 1.23 GHz (TT, 0.80 V, 85° C). The infrastructure is modular, with each compute tile implemented as an independent RTL partition.
Compute Tile Overview
- VEC: Focused on double-precision (FP64) vectorized HPC workloads, VEC is based on the in-order Avispado core (Semidynamics) and a large vector processing unit (VPU, BSC) connected via Open Vector Interface (OVI). It supports RVV 0.7.1, up to 2048-byte vector length, and 40 phys. vector registers. Compiler support via LLVM-EPI enables auto-vectorization across C, C++, and Fortran.
- STX: Purpose-built for stencil- and tensor-oriented machine learning workloads, STX uses lightweight Snitch cores and optional Stencil Processing Units (SPU) with scratchpad memory, Stream Semantic Registers (SSR), and Floating-Point Repetition (FREP) for efficient address generation and data movement. No cache hierarchy; memory is explicitly managed via DMA.
- VRP: Extends floating-point precision up to 512 bits to accelerate iterative numerical solvers. Built on a CVA6 core (CEA) with a custom variable-precision FPU (VPFPU) and RISC-V ISA extension (Xvpfloat), VRP allows runtime selection of numeric format and precision, supporting extended IEEE 754 formats and dense BLAS workloads.
Uncore Infrastructure
EPAC's "uncore" comprises a CHI-based NoC, distributed L2 cache slices (each 256 kB, 8-way set-associative, 512-bit datapath), and a coherence Home Node (HN) implementing MESI-like protocol. The NoC, realized by EXTOLL, uses crosspoint (XP) blocks in a 2D mesh for high throughput (up to 64 GB/s/port, 1 GHz) with granular flow control. External memory access relies on a chip-to-chip (C2C) SerDes link offering up to 25 GB/s per direction. The L2 slices are fully pipelined, support atomic operations, SECDED, and programmable address interleaving. HN provides cache-line granularity tracking for full system coherence.
Implementation Process and Engineering Lessons
The physical implementation entailed hierarchical RTL design flows coordinated across diverse partners, managed via Cadence tools and Siemens Mentor Calibre for signoff. Consolidation of codebases required interface harmonization, naming conflict resolution, and toolchain alignment. Full-scan DFT, scan compression, and PMBIST coverage were achieved. Area reduction optimizations led to a final chip footprint below 27 mm², with robust IR-drop management.
Board-level bring-up utilized custom daughterboards interfacing with Xilinx FPGA platforms. Early validation covered core peripheral access, NoC connectivity, cache coherency, SerDes bandwidth (validated up to 20 GB/s aggregate), and successful Linux boot (Ubuntu 22.04 LTS) with vector support. All major blocks functioned correctly under sustained HPC workloads (e.g., LINPACK).
Numerical Results and Technical Claims
- VEC: Achieves vectorized throughput processing up to 256 FP64 elements per instruction with eight parallel FU pipelines, sustaining one SIMD instruction per 32 cycles. Compiler-driven vectorization enables high utilization across scientific codes ([12-16]).
- STX: Clusters of Snitch cores, scratchpad memory, and SPU co-processors yield up to 64 GFLOPS double-precision per tile at 1 GHz. FPU utilization in ML and stencil applications is maximized across diverse benchmarks.
- VRP: Extended-precision hardware enables iterative solvers to converge in fewer iterations versus pure FP64; throughput scales with precision but can approach one instruction per cycle with suitable mixes, validated across conjugate gradient and BiCG variants ([19,20]). VRP exhibits higher efficiency relative to software-only MPFR approaches.
The paper asserts no single best architectural solution for HPC acceleration: each tile provides unique trade-offs in generality, efficiency, and precision. Furthermore, explicit hardware support for variable precision is both feasible and practical within RISC-V, contradicting the conventional wisdom favoring software-based solutions for extended precision.
Practical and Theoretical Implications
EPAC substantiates the viability of diverse RISC-V architectures for HPC acceleration, effectively addressing distinct classes of scientific and ML workloads. Practically, it establishes a European IP base for future processor designs, including compiler toolchains, runtime libraries, and hardware emulation environments. The architectural diversity allows empirical trade-off analysis for future HPC heterogeneous architectures, especially when scaling toward exascale systems with stringent energy and precision requirements.
Theoretically, EPAC demonstrates that hardware-based variable precision can materially improve convergence and stability in iterative scientific computations, thus impacting future algorithm-architecture co-design. The lessons learned in distributed, multi-partner design coordination foreshadow future challenges in scaling European hardware consortia.
Future Directions in AI and HPC
EPAC's diverse tile architecture is well-suited for next-generation AI/HPC workloads requiring both high throughput and numerical reliability. Future developments may focus on scalable integration of more specialized AI accelerators, improved interconnect bandwidth, and dynamic precision adaptation at the hardware level. Potential directions include deeper compiler-hardware co-design for domain-specific acceleration, further leveraging open ISAs to avoid vendor lock-in, and expanding into automotive and datacenter markets.
Given the experience captured in EPAC, multi-partner design complexity will necessitate new methodologies in system-level integration, toolchain interoperability, and distributed verification for future pan-European processor projects.
Conclusion
EPAC validates the feasibility and utility of a heterogeneous RISC-V accelerator chip in the European context, integrating vector, stencil, and variable-precision compute tiles for varied HPC workloads. The platform provides a comprehensive engineering reference that informs subsequent processor initiatives, underscoring the importance of architectural exploration, open standards, and integrated software infrastructure. The lessons learned around distributed design and the demonstrated silicon success represent valuable assets for both practical deployment and academic investigation into scalable, high-perfomance, general and domain-specific computing architectures.