GRVI Phalanx: RISC-V FPGA Accelerator Architecture
- GRVI Phalanx is a massively parallel RISC-V FPGA accelerator architecture that uses highly space- and power-efficient soft cores with a scalable cluster-based design.
- It features banked scratchpad memories and a low-latency 2D mesh network, achieving high throughput and predictable memory performance.
- The design leverages shared acceleration units and rapid hardware multicast to optimize processing density, power efficiency, and system scalability.
GRVI Phalanx refers to a massively parallel RISC-V FPGA accelerator architecture, distinguished by its highly space- and power-efficient soft processors, scalable cluster-based organization, banked scratchpad memories, and a low-latency, wide on-chip network. It demonstrates that a single FPGA (specifically, a Xilinx Kintex UltraScale KU040) can support hundreds of custom RISC-V cores delivering aggregate throughput and memory bandwidth often associated with large multi-processor ASICs, all within practical power, area, and toolchain constraints (Gray, 2016).
1. GRVI Soft Processor Microarchitecture
At the core of GRVI Phalanx lies the GRVI processor—a compact RV32I RISC-V soft core—optimized for LUT efficiency (MIPS per LUT) and high clock rates on FPGA fabrics. Notable microarchitectural choices include:
- Pipeline Stages: The core permits either a 2-stage pipeline (omitting the instruction fetch latch) or a 3-stage pipeline (optional IF latch, single-cycle decode, single-cycle execute). Loads do not utilize a separate memory stage; data returns via a bypass multiplexer in the writeback path.
- Functional Unit Sharing: To minimize logic duplication, conventional features such as the barrel shifter, hardware multiplier, and byte/half-word load-store logic are omitted from each core and implemented as shared “acceleration units” within a cluster. The base core implements only integer ALU, branch comparator, 2R/1W register file, and PC logic.
- FPGA Resource Utilization: The datapath is highly regular, built as a floorplanned 6-LUT macro, yielding approximately 250 LUTs per datapath, with the total per-core count (including control FSMs and registers) around 320 LUTs. Achievable frequency is up to 375 MHz on KU040.
- Instruction Efficiency: The estimated CPI is ≈1.3 (2-stage) or ≈1.6 (3-stage), giving ≈0.7 MIPS/LUT, a high density for soft-core CPUs on FPGAs.
2. Cluster Organization, Shared Memories, and Messaging
Phalanx organizes GRVI cores into clusters that balance local memory bandwidth, accelerator coupling, and scalability:
- Cluster Composition: Each cluster comprises eight GRVI processing elements (PEs), sharing both instruction memory (IRAM) and a larger multi-ported data scratchpad (CRAM). IRAM is typically 4 KB (via banked BRAMs, one per PE pair); CRAM is a 32 KB, 12-ported block (4×32-bit ports for PEs, 8×32-bit ports for the accelerator/NOC interface).
- PE→CRAM Interconnect: Four 2:1 concentrators compress the eight PE write ports to four CRAM write ports; a 4×4 crossbar plus arbiter resolves bank conflicts. Simultaneous multi-PE access to the same bank results in stalls.
- Accelerator Coupling and Messaging: The eight extra CRAM ports may serve a custom accelerator (memory sharing or independent network client) or be time-multiplexed into a 256-bit-wide “post” port to the Hoplite router, enabling 32-byte message injection into the network per cycle.
- Hardware Multicast: Hoplite’s multicast mechanism can reprogram all IRAMs with a new kernel in 1,024 cycles (~4 μs @ 250 MHz), providing rapid code deployment across the device.
3. Hoplite Network Architecture and Interconnect
The Hoplite network-on-chip (NOC) serves as GRVI Phalanx’s communication and memory coherence fabric:
- Topology: Arranged as a 2D mesh—exemplified by a 10×5 grid on KU040—each cluster integrates a Hoplite router.
- Link Parameters: Each bidirectional channel is 300 bits wide (including 288 bits of payload). The per-link raw bandwidth is
With MHz, per-direction bandwidth is ≈75 Gb/s.
- Aggregate and Bisection Bandwidth: For 10×5 mesh, the bisection bandwidth is
(the measured figure, accounting for protocol overhead, is ~700 Gb/s).
- Routing: Dimension-ordered (XY) routing minimizes path length. One-hop latency is 2–3 cycles; full mesh traversal is 25–30 cycles.
- Packet Handling: A PE packets data into MMIO CRAM regions; the router atomically injects the packet, which is delivered directly to the destination cluster’s CRAM.
4. System Performance Metrics
Phalanx demonstrates high sustained throughput and memory bandwidth at low power:
| Resource | Value (KU040 Example) | Details |
|---|---|---|
| Cores (GRVI PEs) | 400 | 10×5 clusters × 8 PEs |
| Per-core frequency | ~250 MHz | Synthesized frequency, not always peak |
| Peak per-core MIPS | ~325 | 1.3 IPC × 250 MHz |
| Total system MIPS | ~100,000 | 400 × per-core MIPS |
| Shared memory bandwidth | 600 GB/s | 20 clusters × 32 GB/s per cluster |
| NOC bisection bandwidth | ~700 Gb/s | Measured, not raw theoretical |
| Total power (all cores) | ~13 W | ~33 mW/core, stress test |
- Scaling Characteristics: Aggregate throughput scales linearly with core count until saturated by either per-cluster CRAM bandwidth or global NOC bisection.
- Memory Contention: Within a cluster, >4 simultaneous writes incur stalls; across the NOC, head-of-line blocking arises with non-uniform or many-to-one packet flow.
- Power Efficiency: The architecture achieves ~0.7 MIPS/LUT. The total router cost (~40 LUTs per PE) is minor. Energy per instruction is competitive with small hard CPUs, with the benefit of extensive parallelism.
5. Efficiency, Scalability, and Limitations
Phalanx’s architectural decisions facilitate scaling and efficient resource usage, but introduce characteristic bottlenecks:
- Cluster Scalability: Compute scales linearly if each cluster’s memory ports and the NOC bisection are underutilized. Beyond ~20 clusters (observed on KU040), NOC and external memory I/O become the principal system bottlenecks.
- Banked Scratchpad Over Caches: By eschewing traditional caches for multi-ported, banked scratchpads, the system achieves predictable, cycle-accurate memory performance and simplifies arbitration.
- Resource Sharing: Functional units (shifters, multipliers) are shared within a cluster to amortize LUT cost while maintaining high attainable frequency.
- Stall and Arbitration: Within clusters, CRAM banking minimizes contention, but some degree of arbitration is always required when access patterns converge.
A plausible implication is that workloads with uniformly distributed memory accesses and less inter-cluster traffic maximize performance and energy efficiency.
6. Design Philosophy and Architectural Trade-offs
Key engineering choices underpinning GRVI Phalanx:
- Minimalist Microarchitecture: Unused ISA features are omitted from individual soft cores, reducing area and critical path length. Only the cluster incorporates heavier arithmetic units.
- Cluster-Level Configuration: Cluster IRAM/CRAM sizing is tunable; custom accelerators can plug into the shared scratchpad as peers of GRVI PEs.
- Message-Based NOC: 300-bit links enable high-throughput 32-byte packet transfers with efficient hardware-level multicast. This supports both shared memory semantics and rapid, bulk programming.
- FPGA-Aware Floorplanning: The use of floorplanned LUT macros for the data path ensures dense placement and preserves high frequencies even for hundreds of instantiations within FPGA resource constraints.
- Toolchain Compatibility: The design targets standard FPGA synthesis flows and leverages open-source software infrastructure.
7. Significance and Research Context
GRVI Phalanx demonstrates that FPGAs can serve as true manycore accelerators for standard RISC-V workloads, without reliance on commercial hard-IP processors or complex cache hierarchies. Its cluster-based organization and scratchpad-centric memory diverge from mainstream multicore and GPU designs, emphasizing high parallelism, predictable behavior, and resource sharing. This architectural approach has implications for manycore systems research, FPGA-based HPC, and the development of open, software-defined accelerator platforms (Gray, 2016).