Zoozve: RISC-V Strip-Mining-Free Vectorization
- Zoozve is a RISC-V vector ISA extension that eliminates strip-mining by allowing arbitrary grouping of vector registers.
- It reduces dynamic instruction count and register overhead through innovative software-hardware co-design and a data-adaptive compilation flow.
- Performance evaluations show significant speedups for kernels like FFT, DOT, and AXPY with only a modest increase in hardware area.
Zoozve is a RISC-V vector instruction set extension designed to eliminate the need for strip-mining in vectorized computations. By allowing flexible, arbitrary grouping of vector registers and introducing new architectural concepts and software-hardware co-design, Zoozve addresses performance and efficiency bottlenecks inherent in the standard RISC-V Vector Extension (RVV) when operating on very long vectors. Its approach enables reduction of dynamic instruction count, improved register utilization, and reduced hardware and software overheads, while maintaining manageable implementation cost (Xu et al., 22 Apr 2025).
1. Background: Strip-Mining in Vector Architectures
The standard RVV architecture configures vector registers dynamically via privileged CSRs (e.g., vlenb, vtype) and employs a limited set of power-of-two groupings, designated by the LMUL field (values: , , , $1$, $2$, $4$, $8$). This design leads to significant limitations for very long vectors. Physical register constraints and enforced group size quantization mean that logical vectors often cannot be mapped one-to-one to the available hardware resources. Consequently, a large logical vector of length must be decomposed—strip-mined—into many smaller "strips" of size at most .
In RVV, strip-mining is managed through either hardware-controlled vsetvl sequences or compiler-generated inner loops with predication. Each strip mining iteration incurs additional loop controls, repeated CSR writes (vsetvl), predication mask manipulations, and potential register spills for "tail" elements that do not align exactly with vector-length multiples. The net result is increased dynamic instruction count, greater register pressure, misaligned tail handling, code bloat, and reduced performance headroom (Xu et al., 22 Apr 2025).
2. Zoozve ISA Extensions and Architectural Innovations
Zoozve introduces several key modifications and additions to the vector ISA to eliminate strip-mining:
- Instruction Format Enhancements: Each Zoozve vector instruction encodes a 13-bit
v_headfield (with variants likevd_head,vs1_head,vs2_head), supporting addressing of up to physical registers. A scalarrs_avlregister specifies the active vector length (AVL). Opcodes and function bits reside in reserved custom encoding spaces. - Vector Operation Support:
- Arithmetic and logical instructions (add, mul, and, or, etc.) accept (
vd_head,vs1_head,vs2_head,rs_avl). - Asymmetric gather/scatter is directly supported: for scatter, for , and for gather, , each with flexible AVL. This design allows source and destination vectors to have independently specified logical lengths, eliminating RVV's enforced VL equality.
- Arithmetic and logical instructions (add, mul, and, or, etc.) accept (
- Arbitrary Grouping: Logical vectors of length are mapped with user- or compiler-selected grouping: for any , not constrained to powers of two. The compiler chooses the decomposition ({group count} × {group length}) to precisely fit register availability, minimizing or eliminating unused vector lanes and register waste.
- Register Overhead Modeling: Comparing overhead,
- RVV:
- Zoozve: or minimized, by direct matching of grouping to .
These changes facilitate vector instructions that can process the entire logical vector in a single wide instruction, bypassing the repeated strip-mining steps of traditional RVV (Xu et al., 22 Apr 2025).
3. Data-Adaptive Register Allocation and Compilation Flow
Zoozve’s data-adaptive allocation algorithm involves a staged software pipeline:
- Clang Built-ins to Intrinsics: User code invokes vector operations via built-ins (e.g.,
__builtin_z_add), specifying vector length (VL) and element data type. - IR Mapping: Clang converts built-ins to
zoozve.*LLVM IR intrinsics, encoding intended vector length and type. - Intrinsic Splitting: For vectors exceeding hardware resource constraints, an IR opt pass splits high-level vector operations into intrinsics, each of length, using delimiter intrinsics to demarcate register group boundaries.
- Register Allocation (RA): A specialized live-interval pass ensures the logical registers are assigned to consecutive physical registers. The pass selects heuristically to maximize performance within register constraints.
- Coalescing: At assembly, consecutive “strips” with identical opcode, AVL, and contiguous are re-merged into wide Zoozve instructions to minimize strip-mining.
The total cost is modeled as (register spill), and the allocation algorithm seeks to pick the largest that does not oversubscribe physical registers (Xu et al., 22 Apr 2025).
4. Compiler and Hardware Implementation
Zoozve’s implementation covers both the software compiler stack and underlying hardware:
Compiler Enhancements
- Extended Clang and LLVM passes to support Zoozve intrinsics and splitting.
- Custom register allocation that respects delimiter-enforced adjacency.
- Final coalescing of adjacent strip instructions post allocation.
- All steps are compatible with LLVM 15.6.0 and run in sequence: builtins→intrinsics mapping, intrinsic splitting, RA with delimiter, coalescing, outputting wide Zoozve instructions.
Hardware Modifications
- Register Addressing: Increased CSR width and new v_head register bits to permit large logical register windows.
- Hazard Detection: Comparator logic for detecting overlap in assigned register groups for in-flight instructions, stalling to prevent hazards.
- Data Path: A shuffle engine (crossbar with per-lane PEs) enables efficient handling of gather/scatter operations and arbitrary register groupings. The base symmetric lanes remain unchanged for symmetric cases.
- CSR Interface: v_head registers are saved and managed via the scoreboard; AVL only needs writing at the beginning of a Zoozve instruction sequence, without per-strip updates.
- Clock and Power: Critical path delay is not affected (400 MHz), and power overhead remains below 3% under full load.
Synthesis at SMIC 40 nm (64 lanes, 1024 registers) shows that the Zoozve extension adds only 0.56 mm² (≈ 5.2%) to a 10.7 mm² core, with dominant area contributors being enhanced register-address comparators and the data-path shuffle engine (Xu et al., 22 Apr 2025).
5. Performance Evaluation and Measured Gains
Comprehensive evaluation using LLVM 15.6.0, Spike simulator, and custom Zoozve emulation covered three microbenchmarks: FFT, DOT product, and AXPY (from OpenBLAS), with test vector lengths spanning 32 to 2048 (FFT) and up to 16,384 (AXPY/DOT).
| Kernel | RVV Instr. Cnt (min–max) | Zoozve Instr. Cnt | Speedup (max) |
|---|---|---|---|
| FFT | N=32: 102 up to 2144 | 17 | 344.44× |
| DOT | 52→1292 | 17 | 76× |
| AXPY | 25→707 | 12 | 58.92× |
The major benefits derive from:
- Eliminating strip-mining loops and vsetvl CSR writes,
- Forgoing predication mask updates for each strip,
- Preventing register spills endemic to tail strip handling.
This suggests that for long-vector workloads, the dynamic instruction count is drastically reduced—by over an order of magnitude for all considered kernels—with negligible critical-path and power penalties and a manageable silicon area impact. (Xu et al., 22 Apr 2025)
6. Design Trade-Offs, Compatibility, and Open Questions
Zoozve’s modifications introduce trade-offs in compatibility and raise several open challenges:
- Compatibility: Adoption requires custom-0/1/2 opcodes and larger CSRs, thus necessitating non-trivial changes to standard RVV toolchains and hardware logic.
- Predication: Current design supports only a fixed AVL in rs_avl, with disabling of predicated (whilelt) loops. Reintroducing RVV’s full mask register semantics remains open.
- Multi-core Coherence: Ensuring correct core-level interaction when different cores use different group sizes for the same logical registers is unresolved.
- Debug and Monitoring: Introspection of wide, coalesced instructions for debugging or performance profiling presents new challenges.
A plausible implication is that while Zoozve demonstrates significant performance and efficiency improvements, successful adoption at scale will depend on addressing ecosystem and toolchain integration, extending mask/predication support, and ensuring robust software-hardware interface diagnostics (Xu et al., 22 Apr 2025).
7. Conclusions and Impact
Zoozve represents a substantial refinement of vector register management, enabling arbitrary, exact-length grouping and direct mapping of long logical vectors to hardware, thereby obviating the need for strip-mining. Its architectural extensions—principally the new v_head/rs_avl based encoding, the data-adaptive LLVM compilation pipeline, and minimal hardware augmentations (shuffle engine, comparator logic)—deliver 10×–300× reductions in dynamic instruction counts for key linear algebra and signal processing kernels. The observed silicon area increase is modest (5.2%), with critical-path and power impacts kept within practical bounds.
These properties make Zoozve a compelling candidate for next-generation ultra-long-vector RISC-V designs targeting high-performance data-parallel workloads (Xu et al., 22 Apr 2025).