COPIFTv2: Dual-Issue RISC-V Enhancement

Updated 27 January 2026

COPIFTv2 is an architectural and programming refinement that optimizes lightweight RISC-V cores by using queue-based synchronization for efficient dual-issue execution.
It simplifies the programming model by replacing complex software-driven synchronization with API-driven thread management and explicit queue operations.
COPIFTv2 achieves significant improvements in IPC and energy efficiency with negligible area overhead, providing practical benefits for large-scale ML accelerators.

COPIFTv2 is an architectural and programming refinement for lightweight RISC-V cores, designed to maximize the efficiency and programmability of dual-issue execution under stringent area and energy constraints. Building upon the limitations of the earlier COPIFT approach, COPIFTv2 introduces a hardware-based, queue-centric mechanism for synchronizing and transferring data between integer and floating-point (FP) threads. This enables near-peak dual-issue performance in tiny in-order processors such as Snitch, with substantially reduced software complexity and overhead, directly addressing scalability requirements in large-scale machine learning accelerators (Colagrande et al., 25 Jan 2026).

1. Historical Context and Motivation

Original COPIFT (2015–2025) sought to exploit the decoupled integer/FP execution in lightweight RISC-V designs by orchestrating batches of work between two threads (I-thread and F-thread), using software pipelining and memory buffers for synchronization. Despite achieving up to 1.75× instructions per cycle (IPC) on mixed workloads, COPIFT suffered from multiple intrinsic bottlenecks:

Complex code transformations: Required multi-buffer tiling, explicit software pipelining, and batch-size tuning.
High software overheads: Relied on frequent spill/reload of synchronization data to memory, increasing latency and power.
Limited fine-grain communication: Dependencies could only resolve at batch boundaries, delaying critical-path dataflow.

The target architecture—Snitch core—features a single-issue, in-order RISC-V microarchitecture, with a Tiny Floating-Point Subsystem (FPSS). It implements pseudo–dual-issue by dispatching all instructions through the integer pipeline, offloading FP operations to the FPSS. Core design constraints enforce area budgets below 0.1 mm² and power below 1 mW at 1 GHz, precluding out-of-order logic or complex renaming (Colagrande et al., 25 Jan 2026).

2. Architectural Innovations

COPIFTv2 eliminates software-centric synchronization by embedding two lightweight, hardware-based first-in/first-out (FIFO) queues within the core, facilitating direct, order-preserving communication and synchronization between integer and FP threads:

Queue Configuration:
- Two queues per core: I2F (integer-to-FP) and F2I (FP-to-integer)
- Each queue: 8-entry depth (configurable), 32/64 bits width, SCM-based
- Control logic: head/tail pointers, full/empty flags, modulo- $N$ counter
Operation Semantics:
- Enqueue ("push") into the queue is blocked if full; dequeue ("pop") blocks if empty, enforcing natural producer-consumer synchronization.
- Integer (I) thread and FP (F) thread communicate exclusively via these queues (no indirect memory buffers required).
- At pipeline integration points:
- Integer pipeline write-back: writing to $x31$ with queue-CSR enabled triggers I2F push.
- FPSS write-back: "virtual" rd= $x31$ writes push onto F2I.
- Decoding: reading rs= $x31$ pops from the corresponding queue.

This minimal hardware extension introduces area overhead of ≲1%, minimally disrupts timing (no impact on 1 GHz critical path), and requires only trivial handshake logic (Colagrande et al., 25 Jan 2026).

3. Programming Model

COPIFTv2 replaces the original's complex, transformation-heavy software approach with a simplified model based on API-driven thread management and intra-iteration parallelization:

Model Simplification:
- Eliminates multi-buffer tiling and inter-batch software pipelining.
- No inter-iteration scheduling or explicit modulo scheduling.
Algorithmic Steps:

Construct the data-flow graph (DFG) of interleaved integer and FP ops.
Partition into I-only and F-only subgraphs.
Independently schedule for maximum overlap.
Replace inter-thread dependencies with explicit queue push/pops (via $x31$ ).
Enclose FP subgraph within the hardware FREP loop.

API and Synchronization:
- Thread launch/join library calls:
- int launch_fp_thread(void *entry_point);
- void join_fp_thread();
- Launch configures the relevant CSR, initializes queue addresses, and forks FP execution.
- Synchronization is managed implicitly by draining the corresponding queue at join.
Pseudocode Example:

// Integer thread (producer)
enable_copift_queues();
launch_fp_thread(fp_worker);
for (i=0; i<N; i++) {
  t = compute_index(i);
  MOV x31, t;  // Push t onto I2F
}
join_fp_thread();

// FP thread (consumer)
fp_worker() {
  for (i=0; i<N; i++) {
    FCVT.D.W x31, ft0;    // Pop from I2F into ft0
    ft1 = fmul(ft0, CONST);
    FSD x31, ft1;         // Push ft1 onto F2I
  }
  return;
}

This API allows straightforward compiler automation and avoids error-prone manual descriptor management (Colagrande et al., 25 Jan 2026).

4. Performance Characteristics

COPIFTv2 substantially increases both throughput and energy efficiency versus COPIFT, as demonstrated with six mixed integer/FP kernels under fixed core power and frequency budgets.

Bench	IPC_baseline	IPC_COPIFT	IPC_COPIFTv2	Speedup_COPIFTv2	EnergyGain
exp	0.92	1.58	1.78	1.13×	1.15×
sin	0.88	1.62	1.81	1.12×	1.17×
poly_lcg	0.80	1.15	1.21	1.05×	1.09×
dot	0.74	1.42	1.68	1.18×	1.20×
matmul	0.70	1.55	1.78	1.15×	1.21×
fft	0.65	1.48	1.70	1.15×	1.18×
Geomean	—	1.48×	1.73×	1.19×	1.21×

Peak IPC: 1.81 with COPIFTv2 (compared to 1.62 under COPIFT)
Wall-clock speedup: Up to 1.49×
Energy-efficiency gain: Up to 1.47× over COPIFT
Power consumption: Remains within 5% of COPIFT due to reduced memory traffic at higher utilization
Overall: Achieves nearly 90% of ideal dual-issue throughput (2.0 IPC) on in-order, area- and energy-limited silicon (Colagrande et al., 25 Jan 2026).

5. Trade-offs and Comparative Positioning

COPIFTv2 introduces negligible hardware overhead (≲1% area, two small queues, single CSR), no impact on core clock, and minimal complexity in control logic (simple head/tail pointers with blocking handshake). When compared to alternate approaches:

Snitch+COPIFTv2: Delivers 1.96× IPC and 1.75× energy gain versus base single-issue Snitch.
NVIDIA Turing SMs: Offer similar INT/FP concurrency but lack open-source PPA data for direct comparison.
HAMSA-DI (dual-issue VLIW): COPIFTv2 achieves superior energy efficiency on mixed workloads, attributed to zero-overhead synchronization and finer-granularity communication.

Open-source implementation and reproducibility ensure the architecture’s accessibility for further research and industrial integration (Colagrande et al., 25 Jan 2026).

6. Implications for ML Accelerators and Future Directions

Large-scale machine learning accelerators instantiate vast numbers of processing elements (PEs), making even modest per-core gains highly consequential at system level. A 1.2× efficiency improvement at the PE level directly yields significant total area and power reductions. COPIFTv2’s queueing mechanism generalizes to other coprocessor scenarios such as INT→SIMD offload or data-centric stream architectures.

Potential future enhancements may include further queue configuration flexibility, expanded support for integration with compiler toolchains, and systematic exploration of queue-based synchronization patterns for broader architectural applicability (Colagrande et al., 25 Jan 2026).

COPIFTv2 demonstrates that dual-issue performance on lightweight, in-order RISC-V cores can be substantially improved through modest, queue-based hardware enhancements and a simplified, synchronization-oriented programming model—without incurring the software or architectural complexity characteristic of prior solutions. The approach enables practical, compiler-friendly dual-issue execution, suited to the aggressive constraints and scaling demands of modern ML accelerator fabrics.

Markdown Report Issue Upgrade to Chat

References (1)

Late Breaking Results: Boosting Efficient Dual-Issue Execution on Lightweight RISC-V Cores (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to COPIFTv2.