- The paper introduces EARTH, a novel architecture using shifting-based optimizations to improve constant-stride and segment memory accesses in RISC-V vector processors.
- The design leverages DROM, LSDO, and RCVRF modules to coalesce memory transactions, achieving up to 14.7x speedups and reducing area by 9.11%.
- By eliminating large segment buffers, EARTH cuts power consumption by 29-41% while maintaining performance on unit-stride and segment-intensive operations.
This paper introduces EARTH (Efficient Architecture for RISC-V Vector Memory Access), a novel hardware architecture designed to improve the efficiency of memory accesses in RISC-V vector processors, particularly for constant-stride and segment memory operations (2504.08334). Current vector processors often struggle with these patterns, leading to performance bottlenecks or excessive hardware overhead.
Problem:
- Constant-stride Access: Existing designs either issue multiple memory requests for elements within the same cache line (inefficient) or use complex, high-overhead crossbar networks to gather/scatter data after coalescing requests (costly in area and power, complex routing).
- Segment Access: Handling the required row-column transposition typically involves either processing elements one by one (severely limiting throughput) or using large, dedicated segment buffers (consuming significant area and power).
EARTH Architecture:
EARTH addresses these issues using shifting-based optimization strategies, implemented through three core innovations integrated into an open-source RISC-V vector unit (Saturn):
- Data Reorganization Module (DROM):
- This module is central to EARTH and uses specialized shift networks: Scatter Shift Network (SSN) and Gather Shift Network (GSN).
- These networks are implemented as layered structures where each layer performs power-of-2 shifts. They are designed to be conflict-free, ensuring data elements can be efficiently rearranged without path interference.
- A Shift Count Generation (SCG) module calculates the necessary shift amounts based on stride, element width, and offset.
- DROM handles the fundamental tasks of gathering (reorganizing strided/scattered data into sequential order) and scattering (distributing sequential data into strided/scattered positions).
- Load/Store Data Organization (LSDO):
- Specifically targets constant-stride accesses.
- Utilizes DROM and a Reverser module (for negative strides).
- Enables coalescing multiple strided memory accesses within the same aligned memory region (e.g., cache line) into a single memory transaction.
- DROM efficiently extracts (gathers) the required elements from the coalesced memory response during loads or arranges (scatters) elements correctly into the memory line during stores.
- Row/Column-accessible Vector Register File (RCVRF):
- Addresses segment access inefficiency without dedicated segment buffers.
- It consists of a Shifted VRF, DROM, and Block Circular Shifters.
- The Shifted VRF partitions registers into banks (8 banks, ELEN-width) using a circular-shifted mapping. This allows simultaneous access to elements belonging to the same segment (column-wise access) distributed across different registers, as well as standard single-register access (row-wise access).
- For column-wise access (needed for segment operations), data read from parallel banks is first aligned by the Block Shifter and then reorganized by DROM into the correct sequential order. For writes, DROM scatters the data before it's written to the banks.
Implementation and Flow:
- EARTH was implemented in Chisel HDL and integrated into the Saturn vector unit on an FPGA platform.
- Strided Flow: Instructions are split by the Load/Store Address Sequencer (LAS/SAS) to maximize coalescing within aligned memory regions. Requests are sent to memory. Ordered responses go to LSDO, where DROM gathers/scatters data using generated shift counts, followed by byte alignment. Data is then written/read to/from RCVRF row-wise.
- Segment Flow: EARTH uses a "Segment-wise" approach. LAS splits requests based on segments, coalescing accesses within the same segment and memory region. Ordered responses are byte-aligned in LSDO and then written to RCVRF using its column-wise access capability, leveraging the Shifted VRF and DROM for transposition.
Evaluation:
- Performance: Compared to the baseline Saturn design, EARTH showed significant speedups on benchmarks dominated by constant-stride accesses (4x–8x, up to 14.7x on stride-intensive tests). Performance on unit-stride and segment-heavy benchmarks remained comparable (within ±3% for unit-stride, ~1.0x for segment-intensive). This demonstrates efficient stride handling and buffer-free segment support without performance loss. Compared to a commercial SpacemiT X60 core (adjusted for frequency and configuration), EARTH showed competitive or superior performance on most benchmarks except those heavy on indexed operations.
- Area: EARTH eliminated the need for large segment buffers. While the RCVRF area increased slightly due to DROM/shifters, the VLSU area decreased significantly. Overall, this led to a 9.11% area reduction in the larger P-Config (VLEN=512) compared to Saturn.
- Power: EARTH reduced power consumption by 29-41% compared to Saturn, primarily due to eliminating segment buffer overhead and reducing memory transactions via coalescing, which lowered internal power despite a slight increase in switching power from the shift logic.
Conclusion:
EARTH presents an effective architecture that significantly improves the performance and efficiency of RISC-V vector processors by tackling key memory access bottlenecks (constant-stride and segment patterns) using novel shifting-based techniques (DROM, LSDO, RCVRF). It achieves substantial speedups for strided operations and eliminates the area/power overhead of segment buffers without sacrificing segment performance, offering a promising design paradigm for future vector processors.