CPM.cu: Pipelined NN Inference Accelerator
- The CPM.cu Acceleration Pipeline is a co-designed methodology that maps neural network layers to specialized CM cores using polyhedral-based scheduling.
- It integrates analog crossbar arrays, digital processing units, and local SRAM to perform in situ matrix computations while minimizing reconfiguration delays.
- The pipeline achieves efficient, energy-saving inference through static scheduling and explicit dataflow control, ensuring robust parallel execution.
The CPM.cu Acceleration Pipeline is a hardware-software co-design methodology aimed at efficient streaming computation, particularly for neural network inference, on computational memory (CM) accelerators. It concretely embodies a pipelined, dataflow-driven approach where neural network layers are statically mapped across an array of specialized cores equipped with crossbar memory arrays, local memory, and digital processing units. The primary innovations include polyhedral-based compile-time scheduling and explicit generation of local control logic, facilitating maximized parallelism, minimized reconfiguration delays, and efficient enforcement of data dependencies within the pipeline architecture (Kourtis et al., 2020).
1. Hardware Architecture and Partitioning Principles
The accelerator comprises multiple “CM cores,” each integrating:
- An analog crossbar memory array (XBAR) for in situ matrix–vector multiplication (MxV),
- A lightweight digital processing unit (DPU) for post-processing tasks (e.g., non-linear functions, pooling),
- A small local SRAM memory (kB scale),
- A local controller (LCU) implementing partition-specific control logic,
- Explicit on-chip interconnect allowing directed data and control flow between cores.
A global memory buffer (GMEM) and a global control unit (GCU) orchestrate communication between the external host and the array of CM cores. The hardware design exposes the full interconnect topology—modeled as a directed acyclic graph—to the software compilation stack, ensuring that mapping of neural network layers onto cores respects topology, memory capacity, and the absence of dataflow cycles.
Neural network graph partitioning proceeds such that each core implements at most a single convolution or other crossbar-mapped operation per partition phase. Cyclic dependencies are statically disallowed, and partition mapping is formulated and solved as an SMT constraint satisfaction problem (e.g., via Z3 constraints on I/O relations and hardware resources).
2. Streaming Dataflow and Pipelined Execution Model
Computation is conceptualized as a dataflow engine: each core executes its assigned layer(s), receiving its input activations (or intermediate results) from upstream cores or global memory, performing the relevant computation, then streaming the outputs to downstream cores or global memory.
For one-shot configuration per inference, all crossbars are loaded with their associated weights at initialization. Thereafter, streaming inference proceeds in a pipelined fashion: in each cycle, activations are transferred from SRAM to the XBAR, MxV is performed, and the DPU executes auxiliary computations or initiates data forwarding. Crucially, expensive crossbar reconfiguration is avoided between inference passes—amortizing the one-time setup over repeated operation.
Cores act as pipeline stages: each waits on the explicit availability of dependently computed data before proceeding. This enables overlapping processing across layers, provided dataflows and buffer sizes are statically analyzed to prevent stalls or hazards.
3. Polyhedral Compilation and Automated Control Synthesis
A central technical challenge resides in generating correct core-local control logic (LCU firmware) to guarantee data dependencies are strictly honored, particularly under tiling and multi-dimensional iteration within deep learning operators (e.g., convolutions).
The polyhedral model is employed: each partition is associated with iteration domains and affine access functions, represented explicitly in integer set notation (e.g., via ISL). For each shared buffer or tensor, the compiler constructs read and write access relations.
The read-after-write (RAW) safety constraint is formalized as computing, for each consumption iteration in a “reader” partition, the lexicographically maximal write from the “writer” partition covering all necessary elements: where is the relation describing valid (read, write) pairs over their indices. This ensures that a given iteration in a core's control FSM only activates once all data it depends on have been produced. The construction involves inverting and composing access relations and applying lexicographical maximization—a process handled by ISL, which can emit control programs as abstract syntax trees driving the LCU implementation.
4. Optimization Strategies and Performance Evaluation
The elimination of inter-inference crossbar reconfiguration and the exploitation of pipeline parallelism across heterogeneous cores are the primary sources of performance enhancement. The compiler's static mapping and control logic generation enable fine-grained pipelining, overlapping downstream and upstream computations subject to buffer and dependency analyses.
Although detailed empirical performance numbers are not yet reported, the architecture is motivated by the expectation of significant speedup over traditional accelerators—in which reconfiguring weights and layer-by-layer execution introduces major latencies. The explicit partitioning of work and the maximization of on-chip data reuse (local memory) further contribute to energy and area efficiency.
Pipeline hazards are systematically avoided by polyhedral scheduling; topology-aware mapping (constraint formulation for memory and connectivity) ensures efficient usage of the available silicon and memory.
5. Comparison with Related Acceleration Pipelines
The CPM.cu acceleration model bears conceptual similarity to task-level pipelines used in MPSoC-based product cipher accelerators (Nawinne et al., 2014), where coarse-grained pipeline stages, resource-optimized cores, and shared-memory buffers achieve significant acceleration and resource balance (e.g., speedups of up to 4.45 in five-stage pipelines). The principal distinguishing features of the CPM.cu approach are its explicit computational memory substrate and its focus on dataflow orchestration for NN inference.
Against FPGA acceleration pipelines (e.g., AutoAccel CPP microarchitecture (Cong et al., 2018)), CPM.cu similarly employs a pipeline of load, compute, and store stages, but with hardware-primitives tailored for analog MxV, software-automated mapping and control via polyhedral models, and architectural topology that is determined at compile time to maximize inference throughput without dynamic hardware evolution.
6. Design Challenges and Ongoing Developments
Mapping large or irregular neural networks to a fixed topology of CM cores introduces intricate hardware-software co-design challenges. The largest impediments are:
- Limited local memory, enforcing strict bounds on tiling and buffer occupancy,
- Static (compile-time) partitioning, as no dynamic crossbar reloading occurs across inference passes,
- Exact scheduling of control signals to optimize pipeline stall minimization.
This suggests open questions concerning scaling to very deep or highly irregular NN graphs, and the integration of more sophisticated buffer management. A plausible implication is that subsequent developments may explore dynamic scheduling or hybrid memory arrangements to remedy static partitioning shortcomings.
Efforts are underway to quantify pipeline utilization, data reuse metrics, and realized throughput using simulation infrastructure and hardware prototypes (such as cmnnc).
7. Significance and Implications
The CPM.cu Acceleration Pipeline methodology demonstrates how formal models—specifically, the polyhedral model for dependency analysis and static scheduling—enable compiler-generated hardware controllers for dataflow accelerators based on computational memory. This approach leverages both architectural innovations (analog crossbars, explicit pipeline mapping) and formal compilation techniques (iteration space modeling, relation composition, lexmax) to expose maximal parallelism and minimize reconfiguration cost.
These principles generalize to other classes of streaming or cryptographic workloads, especially when task-level pipelining and crossbar or other memory-augmented processing units are available. The design and evaluation of such pipelines—balancing resource utilization, parallelism, and control complexity—remain central themes in the development of efficient hardware accelerators for emerging machine learning and cryptography workloads.