Papers
Topics
Authors
Recent
Search
2000 character limit reached

Zoozve: RISC-V Strip-Mining-Free Vectorization

Updated 17 March 2026
  • Zoozve is a RISC-V vector ISA extension that eliminates strip-mining by allowing arbitrary grouping of vector registers.
  • It reduces dynamic instruction count and register overhead through innovative software-hardware co-design and a data-adaptive compilation flow.
  • Performance evaluations show significant speedups for kernels like FFT, DOT, and AXPY with only a modest increase in hardware area.

Zoozve is a RISC-V vector instruction set extension designed to eliminate the need for strip-mining in vectorized computations. By allowing flexible, arbitrary grouping of vector registers and introducing new architectural concepts and software-hardware co-design, Zoozve addresses performance and efficiency bottlenecks inherent in the standard RISC-V Vector Extension (RVV) when operating on very long vectors. Its approach enables reduction of dynamic instruction count, improved register utilization, and reduced hardware and software overheads, while maintaining manageable implementation cost (Xu et al., 22 Apr 2025).

1. Background: Strip-Mining in Vector Architectures

The standard RVV architecture configures vector registers dynamically via privileged CSRs (e.g., vlenb, vtype) and employs a limited set of power-of-two groupings, designated by the LMUL field (values: 18\frac{1}{8}, 14\frac{1}{4}, 12\frac{1}{2}, $1$, $2$, $4$, $8$). This design leads to significant limitations for very long vectors. Physical register constraints and enforced group size quantization mean that logical vectors often cannot be mapped one-to-one to the available hardware resources. Consequently, a large logical vector of length NN must be decomposed—strip-mined—into many smaller "strips" of size at most VLMAXVLMAX.

In RVV, strip-mining is managed through either hardware-controlled vsetvl sequences or compiler-generated inner loops with predication. Each strip mining iteration incurs additional loop controls, repeated CSR writes (vsetvl), predication mask manipulations, and potential register spills for "tail" elements that do not align exactly with vector-length multiples. The net result is increased dynamic instruction count, greater register pressure, misaligned tail handling, code bloat, and reduced performance headroom (Xu et al., 22 Apr 2025).

2. Zoozve ISA Extensions and Architectural Innovations

Zoozve introduces several key modifications and additions to the vector ISA to eliminate strip-mining:

  • Instruction Format Enhancements: Each Zoozve vector instruction encodes a 13-bit v_head field (with variants like vd_head, vs1_head, vs2_head), supporting addressing of up to 2132^{13} physical registers. A scalar rs_avl register specifies the active vector length (AVL). Opcodes and function bits reside in reserved custom encoding spaces.
  • Vector Operation Support:
    • Arithmetic and logical instructions (add, mul, and, or, etc.) accept (vd_head, vs1_head, vs2_head, rs_avl).
    • Asymmetric gather/scatter is directly supported: for scatter, vd[vs2[i]]vs1[i]vd[vs2[i]] \leftarrow vs1[i] for i[0,AVL1]i \in [0, AVL-1], and for gather, vd[i]vs1[vs2[i]]vd[i] \leftarrow vs1[vs2[i]], each with flexible AVL. This design allows source and destination vectors to have independently specified logical lengths, eliminating RVV's enforced VL equality.
  • Arbitrary Grouping: Logical vectors of length NN are mapped with user- or compiler-selected grouping: N=g×LN = g \times L for any g,LNg, L \in \mathbb{N}, not constrained to powers of two. The compiler chooses the decomposition ({group count} × {group length}) to precisely fit register availability, minimizing or eliminating unused vector lanes and register waste.
  • Register Overhead Modeling: Comparing overhead,
    • RVV: OverheadRVV=(gRVV×LMULN)Overhead_{RVV} = (g_{RVV} \times LMUL - N)
    • Zoozve: OverheadZoozve=0Overhead_{Zoozve} = 0 or minimized, by direct matching of grouping to NN.

These changes facilitate vector instructions that can process the entire logical vector in a single wide instruction, bypassing the repeated strip-mining steps of traditional RVV (Xu et al., 22 Apr 2025).

3. Data-Adaptive Register Allocation and Compilation Flow

Zoozve’s data-adaptive allocation algorithm involves a staged software pipeline:

  1. Clang Built-ins to Intrinsics: User code invokes vector operations via built-ins (e.g., __builtin_z_add), specifying vector length (VL) and element data type.
  2. IR Mapping: Clang converts built-ins to zoozve.* LLVM IR intrinsics, encoding intended vector length and type.
  3. Intrinsic Splitting: For vectors exceeding hardware resource constraints, an IR opt pass splits high-level vector operations into S=N/LtargetS = \lceil N / L_{target} \rceil intrinsics, each of LtargetL_{target} length, using delimiter intrinsics to demarcate register group boundaries.
  4. Register Allocation (RA): A specialized live-interval pass ensures the SS logical registers are assigned to SS consecutive physical registers. The pass selects LtargetL_{target} heuristically to maximize performance within register constraints.
  5. Coalescing: At assembly, consecutive “strips” with identical opcode, AVL, and contiguous v_headv\_head are re-merged into wide Zoozve instructions to minimize strip-mining.

The total cost is modeled as Cost(S,Ltarget)α×S+β×Cost(S, L_{target}) \approx \alpha \times S + \beta \times (register spill), and the allocation algorithm seeks to pick the largest LtargetL_{target} that does not oversubscribe physical registers (Xu et al., 22 Apr 2025).

4. Compiler and Hardware Implementation

Zoozve’s implementation covers both the software compiler stack and underlying hardware:

Compiler Enhancements

  • Extended Clang and LLVM passes to support Zoozve intrinsics and splitting.
  • Custom register allocation that respects delimiter-enforced adjacency.
  • Final coalescing of adjacent strip instructions post allocation.
  • All steps are compatible with LLVM 15.6.0 and run in sequence: builtins→intrinsics mapping, intrinsic splitting, RA with delimiter, coalescing, outputting wide Zoozve instructions.

Hardware Modifications

  • Register Addressing: Increased CSR width and new v_head register bits to permit large logical register windows.
  • Hazard Detection: Comparator logic for detecting overlap in assigned register groups for in-flight instructions, stalling to prevent hazards.
  • Data Path: A shuffle engine (crossbar with per-lane PEs) enables efficient handling of gather/scatter operations and arbitrary register groupings. The base symmetric lanes remain unchanged for symmetric cases.
  • CSR Interface: v_head registers are saved and managed via the scoreboard; AVL only needs writing at the beginning of a Zoozve instruction sequence, without per-strip updates.
  • Clock and Power: Critical path delay is not affected (400 MHz), and power overhead remains below 3% under full load.

Synthesis at SMIC 40 nm (64 lanes, 1024 registers) shows that the Zoozve extension adds only 0.56 mm² (≈ 5.2%) to a 10.7 mm² core, with dominant area contributors being enhanced register-address comparators and the data-path shuffle engine (Xu et al., 22 Apr 2025).

5. Performance Evaluation and Measured Gains

Comprehensive evaluation using LLVM 15.6.0, Spike simulator, and custom Zoozve emulation covered three microbenchmarks: FFT, DOT product, and AXPY (from OpenBLAS), with test vector lengths spanning 32 to 2048 (FFT) and up to 16,384 (AXPY/DOT).

Kernel RVV Instr. Cnt (min–max) Zoozve Instr. Cnt Speedup (max)
FFT N=32: 102 up to 2144 17 344.44×
DOT 52→1292 17 76×
AXPY 25→707 12 58.92×

The major benefits derive from:

  • Eliminating strip-mining loops and vsetvl CSR writes,
  • Forgoing predication mask updates for each strip,
  • Preventing register spills endemic to tail strip handling.

This suggests that for long-vector workloads, the dynamic instruction count is drastically reduced—by over an order of magnitude for all considered kernels—with negligible critical-path and power penalties and a manageable silicon area impact. (Xu et al., 22 Apr 2025)

6. Design Trade-Offs, Compatibility, and Open Questions

Zoozve’s modifications introduce trade-offs in compatibility and raise several open challenges:

  • Compatibility: Adoption requires custom-0/1/2 opcodes and larger CSRs, thus necessitating non-trivial changes to standard RVV toolchains and hardware logic.
  • Predication: Current design supports only a fixed AVL in rs_avl, with disabling of predicated (whilelt) loops. Reintroducing RVV’s full mask register semantics remains open.
  • Multi-core Coherence: Ensuring correct core-level interaction when different cores use different group sizes for the same logical registers is unresolved.
  • Debug and Monitoring: Introspection of wide, coalesced instructions for debugging or performance profiling presents new challenges.

A plausible implication is that while Zoozve demonstrates significant performance and efficiency improvements, successful adoption at scale will depend on addressing ecosystem and toolchain integration, extending mask/predication support, and ensuring robust software-hardware interface diagnostics (Xu et al., 22 Apr 2025).

7. Conclusions and Impact

Zoozve represents a substantial refinement of vector register management, enabling arbitrary, exact-length grouping and direct mapping of long logical vectors to hardware, thereby obviating the need for strip-mining. Its architectural extensions—principally the new v_head/rs_avl based encoding, the data-adaptive LLVM compilation pipeline, and minimal hardware augmentations (shuffle engine, comparator logic)—deliver 10×–300× reductions in dynamic instruction counts for key linear algebra and signal processing kernels. The observed silicon area increase is modest (5.2%), with critical-path and power impacts kept within practical bounds.

These properties make Zoozve a compelling candidate for next-generation ultra-long-vector RISC-V designs targeting high-performance data-parallel workloads (Xu et al., 22 Apr 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Strip-Mining-Free Vectorization (Zoozve).