Papers
Topics
Authors
Recent
Search
2000 character limit reached

COPIFTv2: Dual-Issue RISC-V Enhancement

Updated 27 January 2026
  • COPIFTv2 is an architectural and programming refinement that optimizes lightweight RISC-V cores by using queue-based synchronization for efficient dual-issue execution.
  • It simplifies the programming model by replacing complex software-driven synchronization with API-driven thread management and explicit queue operations.
  • COPIFTv2 achieves significant improvements in IPC and energy efficiency with negligible area overhead, providing practical benefits for large-scale ML accelerators.

COPIFTv2 is an architectural and programming refinement for lightweight RISC-V cores, designed to maximize the efficiency and programmability of dual-issue execution under stringent area and energy constraints. Building upon the limitations of the earlier COPIFT approach, COPIFTv2 introduces a hardware-based, queue-centric mechanism for synchronizing and transferring data between integer and floating-point (FP) threads. This enables near-peak dual-issue performance in tiny in-order processors such as Snitch, with substantially reduced software complexity and overhead, directly addressing scalability requirements in large-scale machine learning accelerators (Colagrande et al., 25 Jan 2026).

1. Historical Context and Motivation

Original COPIFT (2015–2025) sought to exploit the decoupled integer/FP execution in lightweight RISC-V designs by orchestrating batches of work between two threads (I-thread and F-thread), using software pipelining and memory buffers for synchronization. Despite achieving up to 1.75× instructions per cycle (IPC) on mixed workloads, COPIFT suffered from multiple intrinsic bottlenecks:

  • Complex code transformations: Required multi-buffer tiling, explicit software pipelining, and batch-size tuning.
  • High software overheads: Relied on frequent spill/reload of synchronization data to memory, increasing latency and power.
  • Limited fine-grain communication: Dependencies could only resolve at batch boundaries, delaying critical-path dataflow.

The target architecture—Snitch core—features a single-issue, in-order RISC-V microarchitecture, with a Tiny Floating-Point Subsystem (FPSS). It implements pseudo–dual-issue by dispatching all instructions through the integer pipeline, offloading FP operations to the FPSS. Core design constraints enforce area budgets below 0.1 mm² and power below 1 mW at 1 GHz, precluding out-of-order logic or complex renaming (Colagrande et al., 25 Jan 2026).

2. Architectural Innovations

COPIFTv2 eliminates software-centric synchronization by embedding two lightweight, hardware-based first-in/first-out (FIFO) queues within the core, facilitating direct, order-preserving communication and synchronization between integer and FP threads:

  • Queue Configuration:
    • Two queues per core: I2F (integer-to-FP) and F2I (FP-to-integer)
    • Each queue: 8-entry depth (configurable), 32/64 bits width, SCM-based
    • Control logic: head/tail pointers, full/empty flags, modulo-NN counter
  • Operation Semantics:
    • Enqueue ("push") into the queue is blocked if full; dequeue ("pop") blocks if empty, enforcing natural producer-consumer synchronization.
    • Integer (I) thread and FP (F) thread communicate exclusively via these queues (no indirect memory buffers required).
    • At pipeline integration points:
    • Integer pipeline write-back: writing to x31x31 with queue-CSR enabled triggers I2F push.
    • FPSS write-back: "virtual" rd=x31x31 writes push onto F2I.
    • Decoding: reading rs=x31x31 pops from the corresponding queue.

This minimal hardware extension introduces area overhead of ≲1%, minimally disrupts timing (no impact on 1 GHz critical path), and requires only trivial handshake logic (Colagrande et al., 25 Jan 2026).

3. Programming Model

COPIFTv2 replaces the original's complex, transformation-heavy software approach with a simplified model based on API-driven thread management and intra-iteration parallelization:

  • Model Simplification:
    • Eliminates multi-buffer tiling and inter-batch software pipelining.
    • No inter-iteration scheduling or explicit modulo scheduling.
  • Algorithmic Steps:
  1. Construct the data-flow graph (DFG) of interleaved integer and FP ops.
  2. Partition into I-only and F-only subgraphs.
  3. Independently schedule for maximum overlap.
  4. Replace inter-thread dependencies with explicit queue push/pops (via x31x31).
  5. Enclose FP subgraph within the hardware FREP loop.
  • API and Synchronization:
    • Thread launch/join library calls:
    • int launch_fp_thread(void *entry_point);
    • void join_fp_thread();
    • Launch configures the relevant CSR, initializes queue addresses, and forks FP execution.
    • Synchronization is managed implicitly by draining the corresponding queue at join.
  • Pseudocode Example:

1
2
3
4
5
6
7
8
// Integer thread (producer)
enable_copift_queues();
launch_fp_thread(fp_worker);
for (i=0; i<N; i++) {
  t = compute_index(i);
  MOV x31, t;  // Push t onto I2F
}
join_fp_thread();

1
2
3
4
5
6
7
8
9
// FP thread (consumer)
fp_worker() {
  for (i=0; i<N; i++) {
    FCVT.D.W x31, ft0;    // Pop from I2F into ft0
    ft1 = fmul(ft0, CONST);
    FSD x31, ft1;         // Push ft1 onto F2I
  }
  return;
}
This API allows straightforward compiler automation and avoids error-prone manual descriptor management (Colagrande et al., 25 Jan 2026).

4. Performance Characteristics

COPIFTv2 substantially increases both throughput and energy efficiency versus COPIFT, as demonstrated with six mixed integer/FP kernels under fixed core power and frequency budgets.

Bench IPC_baseline IPC_COPIFT IPC_COPIFTv2 Speedup_COPIFTv2 EnergyGain
exp 0.92 1.58 1.78 1.13× 1.15×
sin 0.88 1.62 1.81 1.12× 1.17×
poly_lcg 0.80 1.15 1.21 1.05× 1.09×
dot 0.74 1.42 1.68 1.18× 1.20×
matmul 0.70 1.55 1.78 1.15× 1.21×
fft 0.65 1.48 1.70 1.15× 1.18×
Geomean — 1.48× 1.73× 1.19× 1.21×
  • Peak IPC: 1.81 with COPIFTv2 (compared to 1.62 under COPIFT)
  • Wall-clock speedup: Up to 1.49×
  • Energy-efficiency gain: Up to 1.47× over COPIFT
  • Power consumption: Remains within 5% of COPIFT due to reduced memory traffic at higher utilization
  • Overall: Achieves nearly 90% of ideal dual-issue throughput (2.0 IPC) on in-order, area- and energy-limited silicon (Colagrande et al., 25 Jan 2026).

5. Trade-offs and Comparative Positioning

COPIFTv2 introduces negligible hardware overhead (≲1% area, two small queues, single CSR), no impact on core clock, and minimal complexity in control logic (simple head/tail pointers with blocking handshake). When compared to alternate approaches:

  • Snitch+COPIFTv2: Delivers 1.96× IPC and 1.75× energy gain versus base single-issue Snitch.
  • NVIDIA Turing SMs: Offer similar INT/FP concurrency but lack open-source PPA data for direct comparison.
  • HAMSA-DI (dual-issue VLIW): COPIFTv2 achieves superior energy efficiency on mixed workloads, attributed to zero-overhead synchronization and finer-granularity communication.

Open-source implementation and reproducibility ensure the architecture’s accessibility for further research and industrial integration (Colagrande et al., 25 Jan 2026).

6. Implications for ML Accelerators and Future Directions

Large-scale machine learning accelerators instantiate vast numbers of processing elements (PEs), making even modest per-core gains highly consequential at system level. A 1.2× efficiency improvement at the PE level directly yields significant total area and power reductions. COPIFTv2’s queueing mechanism generalizes to other coprocessor scenarios such as INT→SIMD offload or data-centric stream architectures.

Potential future enhancements may include further queue configuration flexibility, expanded support for integration with compiler toolchains, and systematic exploration of queue-based synchronization patterns for broader architectural applicability (Colagrande et al., 25 Jan 2026).


COPIFTv2 demonstrates that dual-issue performance on lightweight, in-order RISC-V cores can be substantially improved through modest, queue-based hardware enhancements and a simplified, synchronization-oriented programming model—without incurring the software or architectural complexity characteristic of prior solutions. The approach enables practical, compiler-friendly dual-issue execution, suited to the aggressive constraints and scaling demands of modern ML accelerator fabrics.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to COPIFTv2.