Papers
Topics
Authors
Recent
Search
2000 character limit reached

SPEED Processor: RISC-V & SFQ Architectures

Updated 6 February 2026
  • SPEED Processor is a dual-architecture design combining a RISC-V multi-precision DNN engine and an SFQ-based spiking neuromorphic system for high throughput and energy efficiency.
  • The RISC-V variant employs custom vector instructions, parameterized compute units, and hybrid dataflow strategies to achieve up to 13.8× throughput improvements and 80–90% memory access reduction.
  • The SFQ variant uses Josephson-junction circuits and spiking LIF networks, offering ultrahigh speed inference with energy efficiencies several orders of magnitude beyond conventional CMOS counterparts.

The SPEED Processor refers to two distinct but unrelated processor architectures—one a state-of-the-art scalable RISC-V vector processor for efficient multi-precision deep neural network (DNN) inference, and the other an ultrahigh speed spiking neuromorphic processor leveraging Single Flux Quantum (SFQ) logic. Both architectures propose significant advances over prior art in throughput, area efficiency, and energy consumption; however, they employ fundamentally different hardware substrates and computational paradigms.

1. RISC-V SPEED Processor: Enabling Efficient Multi-Precision DNN Inference

The RISC-V-based SPEED Processor addresses the throughput, precision, and dataflow limitations endemic to conventional RISC-V platforms when executing quantized multi-precision DNNs (MP-DNNs) (Wang et al., 2024, Wang et al., 2024). Its innovations span custom instruction sets, parameterized hardware microarchitecture, and hybrid dataflow techniques.

1.1 Instruction Set Architecture Extensions

SPEED augments the RISC-V Vector (RVV v1.0) ISA with three (or four, in later revisions) custom instructions:

  • VSACFG: Atomically sets operand precision (P{4,8,16}P\in\{4,8,16\} bits) and dataflow strategy (Feature-map-First/Channel-First) for subsequent operations, encoding operator type, kernel size, and precision in immediate fields. This condenses dozens of CSR writes to a single instruction.
  • VSALD: Simultaneous multi-broadcast vector load from memory to vector register file (VRF), parameterized by operand width, maximizing data reuse across vector lanes.
  • VSAM/VSAM.vv: Performs multi-dimensional matrix MAC or tensor contractions across all lanes, exploiting intra- and inter-PE parallelism dictated by PP and tile geometry.
  • VSAC.vv (later revisions): Matrix×vector/tensor contraction-optimized variant (Wang et al., 2024).

Compared to the Ara RVV processor, SPEED achieves up to 46% instruction count reduction and 1.4× higher throughput for representative DNN kernels (Wang et al., 2024).

1.2 Parameterized Multi-Precision Compute Units

Central to SPEED's compute bandwidth is a multi-precision, parameterizable hardware accelerator (SAU or MPTU) per lane:

  • Each processing element (PE) integrates 16 four-bit multipliers and a 32-bit accumulator.
  • Hardware-level configurability allows mapping these multipliers to a single 16×16 MAC (16 bit), four 8×8 MACs (8 bit), or sixteen parallel 4×4 MACs (4 bit), corresponding to the current operand precision.
  • PE array tiling parameters (#TILE_R, #TILE_C) and total parallel MACs per cycle scale as PP×POI×POWPP \times POI \times POW where PPPP is intra-PE, and POIPOI, POWPOW are inter-PE parallelism (Wang et al., 2024, Wang et al., 2024).

1.3 Dataflow Mapping and Reuse

SPEED implements mixed dataflow strategies to optimize computational and memory resource utilization:

  • FF (Feature-map First): Broadcasts each input tile across output channels and vertical windows to exploit kernel spatial overlap—optimal for standard convolution (K3K\geq 3).
  • CF (Channel-First): Accumulates partial sums in-PE for each input-channel; eliminates multiple VRF round-trips—optimal for pointwise (1×1) convolutions.
  • FFCS/FF (Depthwise CONV): Supports data independence for depthwise layers.

Hybrid mapping automatically reduces total memory accesses (Min+Mwt)(M_{in} + M_{wt}) by 80–90% versus baseline RVV (Wang et al., 2024).

1.4 Performance Metrics and Area Efficiency

Physical implementation on TSMC 28 nm yields the following (for a 4-lane, TILER=TILEC=4TILE_R=TILE_C=4 baseline (Wang et al., 2024), refined further in (Wang et al., 2024)):

Precision Peak Throughput (GOPS) Area Eff. (GOPS/mm²) Energy Eff. (GOPS/W)
16 bit 34.89 31.72 162.15
8 bit 93.65 85.13 435.25
4 bit 287.41–737.9 261.28–614.6 1335.79–1383.4

Compared to Ara, area-efficiency improvements range from 2.04×\times (16 bit) up to 1.63×\times (8 bit), and up to 26.9×\times versus wider RVV baseline designs, with throughput gains as high as 13.8×\times on full DNN benchmarks (Wang et al., 2024, Wang et al., 2024).

2. SFQ SPEED Processor: Single Flux Quantum Spiking Neuromorphic Architecture

Independently, the SPEED Processor also refers to an ultrahigh speed, energy-efficient spiking neuromorphic processor employing SFQ (Single Flux Quantum) logic and Josephson-junction-based circuits (Bozbey et al., 2018). This paradigm is fundamentally distinct, targeting brain-inspired leaky integrate-and-fire (LIF) networks for inference.

2.1 JJ–Neuron and Synaptic Circuit Design

  • JJ–Synapse: A cascade of SQUID “SM1” cells, each representing unit weight, switchable for excitatory/inhibitory influence. Buffer/Quantizer (BQ) converts net flux to signed SFQ pulse trains, functioning as quantized weighted summation.
  • JJ–Soma: Two-loop mutually inductive LIF circuit; accumulates input until a threshold current is reached, then emits a spike via Josephson switching.
  • Pulse-Based Communication: XORs, fan-out, and timing logic all use SFQ pulses as energy-coded information quanta.

2.2 Performance Characterization

  • IRIS Benchmark: A 4-4-3 perceptron demonstrates 1.2×1010^{10} SOPS at 8.6×1011^{11} SOPS/W (including 400× cryocooler overhead).
  • Energy per Operation: Each SFQ event dissipates ≈2.25×1019^{-19} J.
  • Scalability:
    • 16–16–16 network: 5.12×1011^{11} SOPS, ≈2.5×1013^{13} SOPS/W.
    • 64–64–64–64 network: 1.23×1013^{13} SOPS, scaling to ≈1018^{18} SOPS at ≈1017^{17} SOPS/W on a 2 W cryocooler MCM-level system (Bozbey et al., 2018).

2.3 On-Chip Programmability and Training

  • Training: Off-line gradient descent produces real-valued weights, which are quantized (genetic optimizer) to five-level integers supported by the SM1/C1–C4 hardware.
  • Configurability: CMOS logic programs synaptic configuration in situ, supporting re-trainability post-fabrication.

2.4 Comparative Platform Evaluation

SFQ neuromorphic SPEED achieves 5–6 orders of magnitude higher energy efficiency compared to CMOS and one order over projected nanophotonic architectures, with comparable area density. This suggests a distinct Pareto frontier for ultralow-power applications limited by cryogenic system integration (Bozbey et al., 2018).

3. Comparative Summary Table

Variant Substrate Core Objective Peak Perf. (custom benchmark) Energy Eff. (Max) Distinctive Features
RISC-V SPEED [2409...] CMOS (TSMC 28 nm) Multi-precision DNN inference 737.9 GOPS @ 4 bit 1383 GOPS/W Custom RVV ISA, param. tensor units, hybrid dataflows
SFQ SPEED [1812...] Josephson SFQ Spiking LIF neuromorphic inference 1018^{18} SOPS (projected) 1017^{17} SOPS/W LIF JJ-neurons, full SFQ pipeline, program. integer synapses, low temp.

4. Design Trade-Offs, Limitations, and Extensions

4.1 RISC-V SPEED

  • Trade-offs: Custom vector decode logic yields complexity versus the RVV-only baseline; area split heavily favors lane/compute blocks over cross-lane VRF or queue resources.
  • Limits: No hardware for <4<4 bit or non-integer precision, external DRAM bandwidth bottlenecks, and modest control penalty for frequent precision switches.
  • Future Directions: On-chip NVM/embedded DRAM, fine-grained dynamic precision, SIMD predication, and support for sparsity or activation compression (Wang et al., 2024).

4.2 SFQ SPEED

  • Trade-offs: Ultra-high efficiency and density, offset by cryogenic integration overhead and integer-only weights.
  • Limits: Quantization required for synaptic weights; cooling and interconnect dominate scalability at wafer scale.
  • Potential: Further improvements anticipated via zero-static-power logic (AQFP/eRSFQ) and improved MCM-level integration (Bozbey et al., 2018).

5. Significance and Impact

Both architectures advance the state of the art in computational platforms for DNN or spiking neural network inference. The RISC-V SPEED processor establishes a new efficiency and throughput baseline for multi-precision quantized DNNs, suited to resource-constrained edge AI deployments with highly variable per-layer quantization requirements (Wang et al., 2024, Wang et al., 2024). The SFQ SPEED processor demonstrates the physical viability and extreme energy-performance scaling of superconducting LIF neuromorphic systems, motivating continued research into cryogenic control and large-scale system integration for non-von Neumann neuromorphic computing (Bozbey et al., 2018).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SPEED Processor.