Dual-Issue Execution in RISC-V Cores
- Dual-issue execution is a microarchitectural technique that allows simultaneous execution of integer and floating-point instructions to improve IPC and energy efficiency.
- It leverages decoupled pipelines and minimal register file modifications to achieve up to 2× IPC improvements and significant energy gains for balanced workloads.
- Innovations such as SSR, FREP, and FIFO-based synchronization streamline compiler and hardware coordination, reducing overhead while optimizing instruction scheduling.
Dual-issue execution refers to a microarchitectural ability that allows a processor to issue and begin the execution of two instructions per cycle, subject to defined pairing and resource constraints. In the context of energy- and area-efficient core designs—such as in-order RISC-V processors and distributed quantum controllers—dual-issue techniques have advanced both performance and energy efficiency, especially for workloads blending integer and floating-point computations or highly parallel primitives.
1. Architectural Motivation and Design Principles
Modern accelerator systems demand extremely compact, energy-frugal processing elements (PEs). Single-issue in-order cores, such as those based on RV32G, are efficient but can leave pipeline resources underutilized when workloads interleave integer and floating-point (FP) instructions—the classic case for dual-issue. By overlapping the execution of one integer and one FP instruction, a lean dual-issue core can approach twice the instructions per cycle (IPC) of a single-issue baseline, based on the workload’s instruction mix and dataflow dependencies.
Key design constraints include:
- Area and energy minimization: Avoid costly register file (RF) widening or addition of superscalar issue logic.
- Decoupled resources: Independent integer and FP execution pipelines, each with its own RF and ALU, enabling parallel progress when dataflow allows.
- Workload fit: Gains are maximal for mixed integer/FP workloads; purely single-class workloads obtain no benefit.
2. Dual-Issue in RISC-V Cores: From Pseudo Dual-Issue to True Mixed-Type Dual-Issue
Early area-efficient dual-issue mechanisms leveraged the natural separation of integer and FP functionality in designs such as the “Snitch” core. “Pseudo” dual-issue was achieved via ISA extensions—specifically, Stream Semantic Registers (SSR) and the Floating-Point Repetition (FREP) instruction—that decouple memory streaming and looped FP execution from the integer pipeline. With SSR/FREP, as in Snitch, integer and FP segments advance with minimal contention, provided they are independent, yielding up to a 2× energy-efficiency improvement and a 44% throughput gain at 3.2% area overhead (Zaruba et al., 2020).
True dual-issue of mixed sequences, however, requires handling cross-domain dataflow and tightly coupling the scheduling of integer and FP instructions. The COPIFT methodology (Colagrande and Benini) enables this by:
- Partitioning the dataflow graph into integer and FP subgraphs with minimal cross-edges.
- Tiling and software pipelining to buffer inter-thread dependencies to memory.
- Employing SSR to replace explicit FP memory operations and using FREP loops to autonomously issue FP instructions from a hardware buffer.
This approach, augmented with limited RISC-V ISA extensions that ensure cross-domain operations are handled via explicit memory passage and not direct RF accesses, orchestrates true mixed-type dual-issue. Constraints on pairing (one integer and one FP op, with no cross-use of RFs) are carefully enforced to eliminate structural and data hazards. COPIFT achieves an average IPC speedup of 1.47× (peak 2.05×) and a 1.37× energy improvement on diverse benchmarks, with minimal impact on core area (<2%) and timing margin (Colagrande et al., 26 Mar 2025).
3. Microarchitectural and ISA Extensions
The essential microarchitectural mechanisms for dual-issue execution on lightweight RISC-V cores include:
- Dedicated SSR units: Streaming engines decouple FP memory access from the integer pipeline, supporting up to three SSR engines per core.
- FREP micro-loop buffers: Hardware supports buffering of up to 16 FP instructions for autonomous cycling by the FP sequencer, freeing integer fetch/decode stages.
- Minor decode logic adjustments: To detect and manage new custom-1 dual-issue opcodes, steering them into the correct buffers and pipelines.
- No RF port widening: FP and integer RFs retain independent ports, leveraging their physical separation.
Key ISA changes involve:
- Cloning “D” extension instructions (FP-to-integer conversions, comparisons) into a custom opcode space, strictly partitioning RF accesses, and enforcing memory-passing of operands for cross-type transfers.
- Pairing sets are limited to {RV32I op} + {pure FP op via SSR/FREP}, where type-mixing is allowed only if all cross-domain dependencies are memory-buffered.
4. Compiler and Programming Methodologies
Dual-issue support moves a significant portion of complexity into the software toolchain. For COPIFT, the compiler performs:
- Dataflow analysis and partitioning: Divides kernels into integer-only and FP-only threads, minimizing inter-thread dependency edges.
- Loop tiling and software pipelining: Tiles loops to ensure cross-thread dependencies are buffered and to expose sufficient ILP for dual-issue.
- Autonomous code generation: Sets up SSR/FREP prologues and epilogues, emits custom-1 opcodes for cross-type operations, and aligns instruction blocks for optimal hardware utilization.
COPIFTv2 improves programmability and eliminates software pipelining and tiling by integrating hardware FIFOs (one for INT-to-FP, one for FP-to-INT) for direct, fine-grained synchronization and data transfer. Register x31 is repurposed as a queue port, enabling instructions to push or pop data to/from queues, with blocking semantics that automatically uphold ordering and synchronization without additional buffering logic. This substantially reduces overhead and programming complexity, while maintaining performance and efficiency gains (Colagrande et al., 25 Jan 2026).
5. Performance, Energy, and Area Characterization
The table summarizes key results for dual-issue RISC-V cores on mixed integer/FP workloads:
| Approach | Area Overhead | Avg. IPC Speedup | Peak IPC | Energy Improvement |
|---|---|---|---|---|
| Snitch + SSR/FREP | +3.2% | +44% | 1.0 | 2× |
| COPIFT | <2% | 1.47× | 1.75 | 1.37× |
| COPIFTv2 (FIFOs) | <1% | 1.19× over COPIFT | 1.81 | 1.21× over COPIFT |
Peak gains are observed when integer and FP workloads are balanced; speedup diminishes for strongly imbalanced workloads (instruction-type imbalance factor near 1), and purely single-class kernels see no benefit. Energy improvements are realized mainly due to reduced idle periods for integer and FP resources and elimination of explicit loads/stores via SSR. Overhead is negligible relative to baseline area and power.
A useful analytical bound for speedup as a function of instruction-type imbalance () is
which tracks realized improvements across kernels (Colagrande et al., 26 Mar 2025).
6. Trade-Offs, Limitations, and Generalization
Dual-issue schemes remain sensitive to instruction mix and kernel structure:
- Instruction mix: Maximal speedup is obtained on workloads interleaving comparable counts of integer and FP instructions (well-balanced ILP).
- Block size: Tiling and SSR setup overheads must be amortized over sufficiently large kernel blocks; on small inputs, setup costs dominate.
- SSR/FREP constraints: SSR supports only affine or statically indirection memory streaming up to four dimensions; irregular accesses require integer-side prefetch and buffering.
- Buffer limits: FREP is restricted to inner loops of up to 16 instructions, requiring manual partitioning of large FP kernels. COPIFTv2 FIFOs have small depth (e.g., 4), introducing potential stalls if data-transfer rates are not well-matched.
- Compiler complexity: While COPIFTv2 simplifies the software interface, users must still mark queue accesses, but this is automatable and less error-prone than full dataflow tiling.
A plausible implication is that dual-issue methodologies, especially those based on SSR/FREP or queue-based hardware, represent an optimal balance point for integrating high ILP into minuscule, energy-driven in-order cores. They avoid the area and timing costs of superscalar windows while significantly boosting utilization in mixed workloads.
7. Dual-Issue Beyond CPUs: Distributed Quantum Systems
Dual-issue and parallel instruction execution principles extend beyond classical processors. In distributed quantum systems, hierarchical instruction networks with both bitmap and ID-based addressing allow the central controller to issue a single instruction to multiple node controllers (NCs) in the same cycle, enabling “dual issue” of identical quantum gate operations. This hardware-software co-design accelerates execution of quantum circuits by grouping and scheduling parallelizable instructions at compile time and leveraging issue hardware to dispatch to multiple “lanes.”
Results demonstrate compiler+hardware co-design achieves average speedups up to 16.5× across quantum algorithm benchmarks, peaking at 56.2× for maximally parallel workloads. Speedup depends on both hardware configuration (parallelizability factor and bus width ) and on the dependency structure of the quantum circuit (Ronde et al., 18 Nov 2025).
The general principle—broadly applicable to parallel control fabrics—is that dual-issue and multi-issue execution schemes, coupled with compiler transformations and lightweight hardware support, can unlock substantial performance improvements in both classical and quantum domains, without incurring the prohibitive area, energy, or design complexity overheads of traditional superscalar architectures.