CVA6S+: Dual-Issue Superscalar RISC-V Core

Updated 20 March 2026

CVA6S+ is an open-source, dual-issue in-order superscalar RISC-V core that enhances IPC and energy efficiency with advanced microarchitectural features.
It employs a six-stage pipeline with increased fetch bandwidth and supports dual dispatch for integer and floating-point operations, improving performance in automotive and industrial applications.
The integration of a two-level branch predictor with register renaming and an out-of-order, high-bandwidth HPDCache achieves up to 43.5% IPC gain and superior memory throughput while maintaining area efficiency.

CVA6S+ is an open-source, dual-issue, in-order superscalar RISC-V processor core, building on the CVA6 and CVA6S architectures. Designed to meet the performance demands of high-end embedded domains such as automotive and industrial control, CVA6S+ integrates advanced microarchitectural features—improved branch prediction, register renaming, enhanced operand forwarding, and an out-of-order, high-bandwidth L1 D-cache (HPDCache). These enhancements collectively target maximized instructions per cycle (IPC) and area/energy efficiency within the constraints of industrial VLSI flows and open platforms (Tedeschi et al., 20 Apr 2025, Fu et al., 30 May 2025).

1. Pipeline Architecture and Microarchitectural Enhancements

CVA6S+ extends the six-stage pipeline of the baseline CVA6 (Instruction Fetch, Decode, Execute, Memory Access, Writeback, with branch-resolution in Execute) while increasing the fetch, decode, and issue bandwidth to enable sustained two-wide issue. The front end fetches up to two 32-bit or four 16-bit compressed instructions per cycle over a 64-bit fetch bus, storing them in a 16-entry fetch buffer. Decode and issue hardware are duplicated, supporting dual dispatch of both integer and floating-point operations. The Execute stage comprises two integer ALUs and a shared FPU/writeback datapath.

The design remains strictly in-order, with no out-of-order instruction scheduling beyond the HPDCache subsystem. Register renaming, implemented via a 32-entry register-alias table and a free-list, eliminates write-after-write (WAW) hazards without requiring a dynamically scheduled instruction window. ALU-to-ALU forwarding enables zero-latency operand bypass within the same cycle, minimizing serialization from back-to-back integer operations (Tedeschi et al., 20 Apr 2025, Fu et al., 30 May 2025).

2. Branch Prediction and Hazard Mitigation

Branch prediction in CVA6S+ is based on a two-level local predictor (Yeh–Patt style), featuring a 128-entry branch-target buffer with each entry maintaining a 3-bit local history to index a pattern history table. This branch predictor replaces the one-level bimodal predictor of CVA6, reducing branch-misprediction penalty by approximately 30% and yielding up to 4.6% additional IPC on Embench-IoT workloads. The expected misprediction stall can be formalized as:

$C_{\rm mis} = N_{\rm br} \times P_{\rm mis} \times L_{\rm flush}$

where $N_{\rm br}$ is the branch count, $P_{\rm mis}$ the misprediction rate, and $L_{\rm flush}$ the flush penalty in cycles (Tedeschi et al., 20 Apr 2025).

The register renaming scheme, by dynamically tracking most recent writers in a one-to-one architectural-to-physical register mapping, eliminates WAW hazards across dual-issued instructions. Physical register mapping is resolved in the decode stage with a single-cycle penalty, avoiding additional pipeline depth. Operand forwarding directly connects ALU1 to ALU2, ensuring the second of two dependent ALU operations issued in the same cycle can consume results without additional delay. Synthesis results report less than 0.5% clock frequency degradation due to the bypass logic.

3. Memory Subsystem: HPDCache Integration

Distinct from its predecessor's blocking L1 D-cache, CVA6S+ employs the OpenHW Core-V HPDCache. HPDCache is a three-stage, non-blocking pipelined L1 D-cache with multiple request ports, a deep MSHR pool, and a simple hardware prefetcher. It services loads and stores out-of-order, supporting both write-through and write-back policies, as well as atomic memory operations. The comparison to the legacy cache subsystem highlights improved memory-level parallelism and throughput:

Bandwidth improvement: 74.1% average (RaiderSTREAM benchmark)
D-cache area reduction: 19% (0.095 mm² legacy vs. 0.077 mm² HPDCache)
Cache miss-rates are stable for sequential, but effective throughput is improved for irregular access patterns due to out-of-order completion

Bandwidth is calculated as

$BW = \frac{\text{Bytes transferred}}{\text{Cycles}}$

with relative improvement,

$\Delta BW\,(\%) = \frac{BW_{\rm HPD} - BW_{\rm legacy}}{BW_{\rm legacy}} \times 100\%$

(Tedeschi et al., 20 Apr 2025).

4. Quantitative Performance and Efficiency Metrics

Performance evaluation on Embench-IoT and CoreMark kernels demonstrates substantial uplift:

Core	Avg. IPC (Embench)	CoreMark/MHz	Area (core+L1, mm²)	Max Freq (MHz)
CVA6	1.00 (baseline)	2.83	0.175	1,090
CVA6S	+29.6%	3.41	—	—
CVA6S+	+43.5%	3.69	0.191	1,095

The area overhead of CVA6S+ with HPDCache is 9.3% over scalar CVA6, while the pipeline logic grows by ≈28.6%. There is no clock-frequency loss; in fact, SRAM macro re-optimization allows a marginal frequency gain. In a separate implementation experiment (Fu et al., 30 May 2025), CVA6S+ exhibits a 34.4% IPC gain and 6% area increase, achieving leading area efficiency (GOPS/mm²) against both scalar CVA6 and the out-of-order C910.

Energy and area efficiency metrics are defined:

$\mathrm{AreaEfficiency\,(GOPS/mm^2)} = \frac{\mathrm{IPC} \times f_{\mathrm{clk}} \times 10^{-9}}{\mathrm{Area~(mm^2)}}$

$\mathrm{EnergyEfficiency\,(GOPS/W)} = \frac{\mathrm{IPC} \times f_{\mathrm{clk}} \times 10^{-9}}{\mathrm{Power~(W)}}$

A power evaluation at 900 MHz (matmul-int kernel) gives $\approx$ 9.0 GOPS/W for CVA6S+ and 8.9 GOPS/W for CVA6 (Fu et al., 30 May 2025).

5. Comparative Analysis, Trade-offs, and SoC Integration

CVA6S+ is positioned as a high-IPC, area-/energy-efficient solution for RISC-V, offering a middle ground between minimal-complexity scalar cores and resource-intensive, out-of-order superscalar cores such as C910. The main drivers of performance in CVA6S+ are:

Dual-issue front end: +20% IPC uplift
Two-level branch predictor: $\approx$ 30% reduction in branch stall cycles ( $\approx$ 10% IPC)
Register renaming and ALU bypass: additional $\approx$ 6% IPC
HPDCache: 74.1% bandwidth boost enables full dual-issue utilization on memory-intensive kernels

The cost is limited to ≈9.3% total area overhead vs. scalar CVA6, with no frequency loss and a decrease in D-cache area (due to HPDCache macro efficiency). Complexity increase in branch prediction, renaming, and cache subsystems is offset by re-use of verified open-source infrastructure (Tedeschi et al., 20 Apr 2025).

SoC integration is validated via the open-source Cheshire platform, with a 64-bit AXI crossbar, 64 KB 2-way L1 I/D caches, and a 512 KB L2 last-level cache, ensuring RISC-V compliance (PLIC/CLINT interrupt, debug, memory interfaces) (Fu et al., 30 May 2025).

6. Analytical Modeling and Future Directions

Theoretical IPC in a dual-issue in-order design is limited by three principal stall sources. The superscalar IPC model is:

$\mathrm{IPC} \approx W_{\rm issue} \times (1 - S_{\rm dep} - S_{\rm mem} - S_{\rm br})$

where $W_{\rm issue}=2$ , $S_{\rm dep}$ is dependency stall fraction, $S_{\rm mem}$ is memory stall fraction, and $S_{\rm br}$ is branch stall fraction.

For CVA6S+,

$\mathrm{IPC_{CVA6S+}} \approx 2\times(1 - 0.10 - 0.06 - 0.08) \approx 1.435$

This aligns with observed IPC uplifts, quantifying the incremental value of each microarchitectural enhancement.

Potential evolution includes further pipeline widening (e.g., tri-issue), incorporation of a narrow out-of-order scheduling window for deeper instruction-level parallelism, more sophisticated L1 prefetching in HPDCache, and hierarchical memory (L2/vectors/TM support) (Tedeschi et al., 20 Apr 2025). A plausible implication is that such moderate, targeted enhancements can sustain high efficiency even as complexity grows, as evidenced by CVA6S+'s area and energy efficiency leadership in comparative studies (Fu et al., 30 May 2025).

Markdown Report Issue Upgrade to Chat

References (2)

CVA6S+: A Superscalar RISC-V Core with High-Throughput Memory Architecture (2025)

Ramping Up Open-Source RISC-V Cores: Assessing the Energy Efficiency of Superscalar, Out-of-Order Execution (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CVA6S+.