CVA6S+: Dual-Issue Superscalar RISC-V Core
- CVA6S+ is an open-source, dual-issue in-order superscalar RISC-V core that enhances IPC and energy efficiency with advanced microarchitectural features.
- It employs a six-stage pipeline with increased fetch bandwidth and supports dual dispatch for integer and floating-point operations, improving performance in automotive and industrial applications.
- The integration of a two-level branch predictor with register renaming and an out-of-order, high-bandwidth HPDCache achieves up to 43.5% IPC gain and superior memory throughput while maintaining area efficiency.
CVA6S+ is an open-source, dual-issue, in-order superscalar RISC-V processor core, building on the CVA6 and CVA6S architectures. Designed to meet the performance demands of high-end embedded domains such as automotive and industrial control, CVA6S+ integrates advanced microarchitectural features—improved branch prediction, register renaming, enhanced operand forwarding, and an out-of-order, high-bandwidth L1 D-cache (HPDCache). These enhancements collectively target maximized instructions per cycle (IPC) and area/energy efficiency within the constraints of industrial VLSI flows and open platforms (Tedeschi et al., 20 Apr 2025, Fu et al., 30 May 2025).
1. Pipeline Architecture and Microarchitectural Enhancements
CVA6S+ extends the six-stage pipeline of the baseline CVA6 (Instruction Fetch, Decode, Execute, Memory Access, Writeback, with branch-resolution in Execute) while increasing the fetch, decode, and issue bandwidth to enable sustained two-wide issue. The front end fetches up to two 32-bit or four 16-bit compressed instructions per cycle over a 64-bit fetch bus, storing them in a 16-entry fetch buffer. Decode and issue hardware are duplicated, supporting dual dispatch of both integer and floating-point operations. The Execute stage comprises two integer ALUs and a shared FPU/writeback datapath.
The design remains strictly in-order, with no out-of-order instruction scheduling beyond the HPDCache subsystem. Register renaming, implemented via a 32-entry register-alias table and a free-list, eliminates write-after-write (WAW) hazards without requiring a dynamically scheduled instruction window. ALU-to-ALU forwarding enables zero-latency operand bypass within the same cycle, minimizing serialization from back-to-back integer operations (Tedeschi et al., 20 Apr 2025, Fu et al., 30 May 2025).
2. Branch Prediction and Hazard Mitigation
Branch prediction in CVA6S+ is based on a two-level local predictor (Yeh–Patt style), featuring a 128-entry branch-target buffer with each entry maintaining a 3-bit local history to index a pattern history table. This branch predictor replaces the one-level bimodal predictor of CVA6, reducing branch-misprediction penalty by approximately 30% and yielding up to 4.6% additional IPC on Embench-IoT workloads. The expected misprediction stall can be formalized as:
where is the branch count, the misprediction rate, and the flush penalty in cycles (Tedeschi et al., 20 Apr 2025).
The register renaming scheme, by dynamically tracking most recent writers in a one-to-one architectural-to-physical register mapping, eliminates WAW hazards across dual-issued instructions. Physical register mapping is resolved in the decode stage with a single-cycle penalty, avoiding additional pipeline depth. Operand forwarding directly connects ALU1 to ALU2, ensuring the second of two dependent ALU operations issued in the same cycle can consume results without additional delay. Synthesis results report less than 0.5% clock frequency degradation due to the bypass logic.
3. Memory Subsystem: HPDCache Integration
Distinct from its predecessor's blocking L1 D-cache, CVA6S+ employs the OpenHW Core-V HPDCache. HPDCache is a three-stage, non-blocking pipelined L1 D-cache with multiple request ports, a deep MSHR pool, and a simple hardware prefetcher. It services loads and stores out-of-order, supporting both write-through and write-back policies, as well as atomic memory operations. The comparison to the legacy cache subsystem highlights improved memory-level parallelism and throughput:
- Bandwidth improvement: 74.1% average (RaiderSTREAM benchmark)
- D-cache area reduction: 19% (0.095 mm² legacy vs. 0.077 mm² HPDCache)
- Cache miss-rates are stable for sequential, but effective throughput is improved for irregular access patterns due to out-of-order completion
Bandwidth is calculated as
with relative improvement,
(Tedeschi et al., 20 Apr 2025).
4. Quantitative Performance and Efficiency Metrics
Performance evaluation on Embench-IoT and CoreMark kernels demonstrates substantial uplift:
| Core | Avg. IPC (Embench) | CoreMark/MHz | Area (core+L1, mm²) | Max Freq (MHz) |
|---|---|---|---|---|
| CVA6 | 1.00 (baseline) | 2.83 | 0.175 | 1,090 |
| CVA6S | +29.6% | 3.41 | — | — |
| CVA6S+ | +43.5% | 3.69 | 0.191 | 1,095 |
The area overhead of CVA6S+ with HPDCache is 9.3% over scalar CVA6, while the pipeline logic grows by ≈28.6%. There is no clock-frequency loss; in fact, SRAM macro re-optimization allows a marginal frequency gain. In a separate implementation experiment (Fu et al., 30 May 2025), CVA6S+ exhibits a 34.4% IPC gain and 6% area increase, achieving leading area efficiency (GOPS/mm²) against both scalar CVA6 and the out-of-order C910.
Energy and area efficiency metrics are defined:
A power evaluation at 900 MHz (matmul-int kernel) gives 9.0 GOPS/W for CVA6S+ and 8.9 GOPS/W for CVA6 (Fu et al., 30 May 2025).
5. Comparative Analysis, Trade-offs, and SoC Integration
CVA6S+ is positioned as a high-IPC, area-/energy-efficient solution for RISC-V, offering a middle ground between minimal-complexity scalar cores and resource-intensive, out-of-order superscalar cores such as C910. The main drivers of performance in CVA6S+ are:
- Dual-issue front end: +20% IPC uplift
- Two-level branch predictor: 30% reduction in branch stall cycles (10% IPC)
- Register renaming and ALU bypass: additional 6% IPC
- HPDCache: 74.1% bandwidth boost enables full dual-issue utilization on memory-intensive kernels
The cost is limited to ≈9.3% total area overhead vs. scalar CVA6, with no frequency loss and a decrease in D-cache area (due to HPDCache macro efficiency). Complexity increase in branch prediction, renaming, and cache subsystems is offset by re-use of verified open-source infrastructure (Tedeschi et al., 20 Apr 2025).
SoC integration is validated via the open-source Cheshire platform, with a 64-bit AXI crossbar, 64 KB 2-way L1 I/D caches, and a 512 KB L2 last-level cache, ensuring RISC-V compliance (PLIC/CLINT interrupt, debug, memory interfaces) (Fu et al., 30 May 2025).
6. Analytical Modeling and Future Directions
Theoretical IPC in a dual-issue in-order design is limited by three principal stall sources. The superscalar IPC model is:
where , is dependency stall fraction, is memory stall fraction, and is branch stall fraction.
For CVA6S+,
This aligns with observed IPC uplifts, quantifying the incremental value of each microarchitectural enhancement.
Potential evolution includes further pipeline widening (e.g., tri-issue), incorporation of a narrow out-of-order scheduling window for deeper instruction-level parallelism, more sophisticated L1 prefetching in HPDCache, and hierarchical memory (L2/vectors/TM support) (Tedeschi et al., 20 Apr 2025). A plausible implication is that such moderate, targeted enhancements can sustain high efficiency even as complexity grows, as evidenced by CVA6S+'s area and energy efficiency leadership in comparative studies (Fu et al., 30 May 2025).