COPIFTv2: Dual-Issue RISC-V Enhancement
- COPIFTv2 is an architectural and programming refinement that optimizes lightweight RISC-V cores by using queue-based synchronization for efficient dual-issue execution.
- It simplifies the programming model by replacing complex software-driven synchronization with API-driven thread management and explicit queue operations.
- COPIFTv2 achieves significant improvements in IPC and energy efficiency with negligible area overhead, providing practical benefits for large-scale ML accelerators.
COPIFTv2 is an architectural and programming refinement for lightweight RISC-V cores, designed to maximize the efficiency and programmability of dual-issue execution under stringent area and energy constraints. Building upon the limitations of the earlier COPIFT approach, COPIFTv2 introduces a hardware-based, queue-centric mechanism for synchronizing and transferring data between integer and floating-point (FP) threads. This enables near-peak dual-issue performance in tiny in-order processors such as Snitch, with substantially reduced software complexity and overhead, directly addressing scalability requirements in large-scale machine learning accelerators (Colagrande et al., 25 Jan 2026).
1. Historical Context and Motivation
Original COPIFT (2015–2025) sought to exploit the decoupled integer/FP execution in lightweight RISC-V designs by orchestrating batches of work between two threads (I-thread and F-thread), using software pipelining and memory buffers for synchronization. Despite achieving up to 1.75× instructions per cycle (IPC) on mixed workloads, COPIFT suffered from multiple intrinsic bottlenecks:
- Complex code transformations: Required multi-buffer tiling, explicit software pipelining, and batch-size tuning.
- High software overheads: Relied on frequent spill/reload of synchronization data to memory, increasing latency and power.
- Limited fine-grain communication: Dependencies could only resolve at batch boundaries, delaying critical-path dataflow.
The target architecture—Snitch core—features a single-issue, in-order RISC-V microarchitecture, with a Tiny Floating-Point Subsystem (FPSS). It implements pseudo–dual-issue by dispatching all instructions through the integer pipeline, offloading FP operations to the FPSS. Core design constraints enforce area budgets below 0.1 mm² and power below 1 mW at 1 GHz, precluding out-of-order logic or complex renaming (Colagrande et al., 25 Jan 2026).
2. Architectural Innovations
COPIFTv2 eliminates software-centric synchronization by embedding two lightweight, hardware-based first-in/first-out (FIFO) queues within the core, facilitating direct, order-preserving communication and synchronization between integer and FP threads:
- Queue Configuration:
- Two queues per core: I2F (integer-to-FP) and F2I (FP-to-integer)
- Each queue: 8-entry depth (configurable), 32/64 bits width, SCM-based
- Control logic: head/tail pointers, full/empty flags, modulo- counter
- Operation Semantics:
- Enqueue ("push") into the queue is blocked if full; dequeue ("pop") blocks if empty, enforcing natural producer-consumer synchronization.
- Integer (I) thread and FP (F) thread communicate exclusively via these queues (no indirect memory buffers required).
- At pipeline integration points:
- Integer pipeline write-back: writing to with queue-CSR enabled triggers I2F push.
- FPSS write-back: "virtual" rd= writes push onto F2I.
- Decoding: reading rs= pops from the corresponding queue.
This minimal hardware extension introduces area overhead of ≲1%, minimally disrupts timing (no impact on 1 GHz critical path), and requires only trivial handshake logic (Colagrande et al., 25 Jan 2026).
3. Programming Model
COPIFTv2 replaces the original's complex, transformation-heavy software approach with a simplified model based on API-driven thread management and intra-iteration parallelization:
- Model Simplification:
- Eliminates multi-buffer tiling and inter-batch software pipelining.
- No inter-iteration scheduling or explicit modulo scheduling.
- Algorithmic Steps:
- Construct the data-flow graph (DFG) of interleaved integer and FP ops.
- Partition into I-only and F-only subgraphs.
- Independently schedule for maximum overlap.
- Replace inter-thread dependencies with explicit queue push/pops (via ).
- Enclose FP subgraph within the hardware FREP loop.
- API and Synchronization:
- Thread launch/join library calls:
int launch_fp_thread(void *entry_point);void join_fp_thread();- Launch configures the relevant CSR, initializes queue addresses, and forks FP execution.
- Synchronization is managed implicitly by draining the corresponding queue at join.
- Pseudocode Example:
1 2 3 4 5 6 7 8 |
// Integer thread (producer) enable_copift_queues(); launch_fp_thread(fp_worker); for (i=0; i<N; i++) { t = compute_index(i); MOV x31, t; // Push t onto I2F } join_fp_thread(); |
1 2 3 4 5 6 7 8 9 |
// FP thread (consumer) fp_worker() { for (i=0; i<N; i++) { FCVT.D.W x31, ft0; // Pop from I2F into ft0 ft1 = fmul(ft0, CONST); FSD x31, ft1; // Push ft1 onto F2I } return; } |
4. Performance Characteristics
COPIFTv2 substantially increases both throughput and energy efficiency versus COPIFT, as demonstrated with six mixed integer/FP kernels under fixed core power and frequency budgets.
| Bench | IPC_baseline | IPC_COPIFT | IPC_COPIFTv2 | Speedup_COPIFTv2 | EnergyGain |
|---|---|---|---|---|---|
| exp | 0.92 | 1.58 | 1.78 | 1.13× | 1.15× |
| sin | 0.88 | 1.62 | 1.81 | 1.12× | 1.17× |
| poly_lcg | 0.80 | 1.15 | 1.21 | 1.05× | 1.09× |
| dot | 0.74 | 1.42 | 1.68 | 1.18× | 1.20× |
| matmul | 0.70 | 1.55 | 1.78 | 1.15× | 1.21× |
| fft | 0.65 | 1.48 | 1.70 | 1.15× | 1.18× |
| Geomean | — | 1.48× | 1.73× | 1.19× | 1.21× |
- Peak IPC: 1.81 with COPIFTv2 (compared to 1.62 under COPIFT)
- Wall-clock speedup: Up to 1.49×
- Energy-efficiency gain: Up to 1.47× over COPIFT
- Power consumption: Remains within 5% of COPIFT due to reduced memory traffic at higher utilization
- Overall: Achieves nearly 90% of ideal dual-issue throughput (2.0 IPC) on in-order, area- and energy-limited silicon (Colagrande et al., 25 Jan 2026).
5. Trade-offs and Comparative Positioning
COPIFTv2 introduces negligible hardware overhead (≲1% area, two small queues, single CSR), no impact on core clock, and minimal complexity in control logic (simple head/tail pointers with blocking handshake). When compared to alternate approaches:
- Snitch+COPIFTv2: Delivers 1.96× IPC and 1.75× energy gain versus base single-issue Snitch.
- NVIDIA Turing SMs: Offer similar INT/FP concurrency but lack open-source PPA data for direct comparison.
- HAMSA-DI (dual-issue VLIW): COPIFTv2 achieves superior energy efficiency on mixed workloads, attributed to zero-overhead synchronization and finer-granularity communication.
Open-source implementation and reproducibility ensure the architecture’s accessibility for further research and industrial integration (Colagrande et al., 25 Jan 2026).
6. Implications for ML Accelerators and Future Directions
Large-scale machine learning accelerators instantiate vast numbers of processing elements (PEs), making even modest per-core gains highly consequential at system level. A 1.2× efficiency improvement at the PE level directly yields significant total area and power reductions. COPIFTv2’s queueing mechanism generalizes to other coprocessor scenarios such as INT→SIMD offload or data-centric stream architectures.
Potential future enhancements may include further queue configuration flexibility, expanded support for integration with compiler toolchains, and systematic exploration of queue-based synchronization patterns for broader architectural applicability (Colagrande et al., 25 Jan 2026).
COPIFTv2 demonstrates that dual-issue performance on lightweight, in-order RISC-V cores can be substantially improved through modest, queue-based hardware enhancements and a simplified, synchronization-oriented programming model—without incurring the software or architectural complexity characteristic of prior solutions. The approach enables practical, compiler-friendly dual-issue execution, suited to the aggressive constraints and scaling demands of modern ML accelerator fabrics.