Papers
Topics
Authors
Recent
2000 character limit reached

CADC: Crossbar-Aware Dendritic Conv

Updated 4 December 2025
  • CADC is an in-memory computing method that integrates a biologically inspired ReLU nonlinearity into crossbar operations to enhance partial sum sparsity.
  • It reduces buffer, computational overhead, and ADC noise accumulation, leading to significant energy savings and speed improvements.
  • Empirical evaluations show that CADC delivers 11x–18x speedups and up to 22.9x energy-efficiency gains with minimal impact on classification accuracy.

Crossbar-Aware Dendritic Convolution (CADC) is an in-memory computing (IMC) technique for convolutional neural networks (CNNs) and spiking neural networks (SNNs) that introduces a biologically inspired nonlinearity at the level of crossbar-based partial sum (psum) generation. CADC addresses the system-level bottlenecks arising from partitioned convolutional layers across multiple crossbars by embedding a rectification function directly within each crossbar, thereby enhancing psum sparsity, reducing buffer and computational overhead, and minimizing signal degradation from analog-to-digital conversion (ADC) noise. Empirical evaluations demonstrate substantial system-level speedups and energy efficiency improvements, with negligible—sometimes positive—impact on classification accuracy (Dong et al., 27 Nov 2025).

1. Convolution Partitioning and the CADC Algorithm

Crossbar-based IMC architectures decompose convolutional layers into multiple segments due to size constraints. Given a convolutional weight tensor of shape Cin×K1×K2×CoutC_{\rm in}\times K_1 \times K_2 \times C_{\rm out}, spatial and input channel unrolling produces a (CinK1K2)×Cout(C_{\rm in}K_1K_2)\times C_{\rm out} matrix. A crossbar of size N×NN\times N can only accommodate NN rows, requiring the input dimension to be partitioned into S=CinK1K2/NS = \lceil C_{\rm in}K_1K_2/N \rceil segments.

In standard (vanilla) convolution (vConv), each segment ss produces a psum for each output channel kk: y[k]=s=1Si=1Nsws[i,k]xs[i]y[k] = \sum_{s=1}^S \sum_{i=1}^{N_s} w^s[i,k]\, x^s[i] where NsNN_s \le N. CADC introduces a dendritic nonlinearity—specifically, rectification via ReLU—on each crossbar’s output before accumulation: fdendrite(x)=max(0,x)f_{\rm dendrite}(x) = \max(0,x) The accumulated output is then: y[k]=s=1Swk[s]fdendrite(i=1Nsws[i,k]xs[i])y[k] = \sum_{s=1}^S w^k[s]\, f_{\rm dendrite}\left( \sum_{i=1}^{N_s} w^s[i,k]\, x^s[i] \right) Typically, wk[s]=1w^k[s]=1, incurring no additional weight storage or computational cost. This zero-clamping function inside each crossbar outputs only non-negative psums to the next stage.

2. Psum Sparsity: Analysis and Implications

Let the pre-rectification psum for segment ss and channel kk be zs,k=iws[i,k]xs[i]z_{s,k} = \sum_i w^s[i,k]\, x^s[i]. For layer ll, psum sparsity is defined as: pl=#{(s,k):zs,k0}SCoutp_l = \frac{\#\{(s,k) : z_{s,k} \le 0\}}{S \cdot C_{\rm out}} Averaging over output channels, (1pl)S(1-p_l)S nonzero psums remain. This sparsity directly reduces the buffer and transfer overhead (since fewer non-zero psums must be stored and moved), and also reduces accumulation overhead—zero entries can be skipped, yielding proportional reductions in cycles and energy.

Empirically measured mean sparsities are:

  • LeNet-5 (MNIST): 80%
  • ResNet-18 (CIFAR-10): 54%
  • VGG-16 (CIFAR-100): 66%
  • SNN (DVS Gesture): 88%

Consequently, buffer and transfer energy reductions of up to 29.3% and accumulation energy reductions of 47.9% are achieved for ResNet-18 on CIFAR-10.

3. ADC Quantization Noise and Signal Integrity

In IMC, each ADC invocation introduces quantization error ε\varepsilon with variance σε2\sigma_\varepsilon^2. Conventional vConv accumulates this error across SS segments: σtotal2Sσε2\sigma_{\rm total}^2 \approx S\, \sigma_\varepsilon^2 Since CADC suppresses negative psums, only (1pl)S(1-p_l)S terms contribute: σCADC2(1pl)Sσε2\sigma_{\rm CADC}^2 \approx (1-p_l)S\, \sigma_\varepsilon^2 Thus, root-mean-square noise is reduced by 1pl\sqrt{1-p_l}. For instance, pl=0.54p_l=0.54 (ResNet-18) yields about one-third less accumulated noise, correlating with minimal accuracy degradation—only 0.1% top-1 drop under 4-bit ADC quantization.

4. Experimental Results: Sparsity, Accuracy, and System Throughput

Sparsity and Classification Accuracy

CADC’s induced psum sparsity is highly correlated with downstream efficiency gains; it also has minimal impact on, or sometimes improves, classification accuracy. Measured statistics across various models and datasets are summarized below.

Model–Dataset Psum Sparsity Accuracy Change (relative to vConv, best–worst)
LeNet-5 (MNIST) 80% +0.11% ~ +0.19%
ResNet-18 (CIFAR-10) 54% –0.04% ~ –0.27%
VGG-16 (CIFAR-100) 66% +0.99% ~ +1.60%
SNN (DVS Gesture) 88% –0.57% ~ +1.32%

These results are consistent across crossbar sizes (64×6464\times64 to 256×256256\times256). Even with aggressive negative psum pruning, CADC typically matches or exceeds vConv’s accuracy.

System-Level Performance

For ResNet-18 on CIFAR-10 using a 65nm SRAM-based IMC macro:

  • Crossbar/MAC/ADC: 725 TOPS/W (4b I/O, 2b weight)
  • End-to-end throughput: 2.15 TOPS at 200 MHz
  • Energy efficiency: 40.8 TOPS/W (normalized to 65 nm, 1.1 V)
  • Speedup vs prior SRAM-IMC: 11×11\times18×18\times
  • Energy-efficiency improvement: 1.9×1.9\times22.9×22.9\times

5. Architectural Implementation

CADC is realized on a 256×256256\times256 twin-9T SRAM crossbar supporting ternary weights {1,0,+1}\{-1,0,+1\} with decoupled read paths. The in-memory ADC (IMA) shares the bitcells, employing a ramp-based reference embedded in the conversion loop. The ReLU nonlinearity is enacted by control/timing of word lines: if the crossbar’s output voltage does not cross the ADC ramp threshold, the output is zeroed.

Further, only nonzero psums (with a bitmask) are buffered post-ADC; downstream accumulator logic skips zeros by consulting the mask. The entire macro is compact (0.5 mm2^2 at 65nm), with IMA occupying only 14.9% of the die—less than with SAR or conventional IMA-based ADCs.

6. Design Trade-offs and Prospective Enhancements

Hardware overhead for CADC is minimal: the crossbar and ADC bitcells are reused, with the zero-mask and skip-control logic imposing negligible area or power penalty. The choice of fdendritef_{\rm dendrite} is model-dependent: ReLU is optimal for typical CNNs, while x\sqrt{x} provided improvement for SNNs. CADC’s psum sparsity is synergistic with any zero-compression or sparse accumulation scheme and orthogonal to weight quantization or global pruning.

Potential future directions include learned or adaptive dendritic nonlinearities, finer-grained skipping mechanisms (e.g., per-bit), and expansion to alternative memory technologies such as RRAM or to other layer types (e.g., depthwise convolutions, transformer attention mechanisms). A plausible implication is that the core CADC principle could generalize to broader classes of in-memory compute and neuromorphic architectures.

7. Summary and Significance

Crossbar-Aware Dendritic Convolution leverages a biologically inspired ReLU-like nonlinearity applied at the granularity of crossbar-generated partial sums. By eliminating negative psums in-situ, CADC maximizes psum sparsity, reduces interconnect and accumulation workload, and mitigates ADC-related noise accumulation. These improvements translate to substantial empirical gains in throughput and energy efficiency, with robust model accuracy on representative benchmarks (Dong et al., 27 Nov 2025). The approach is compatible with standard digital and mixed-signal crossbars, incurs negligible hardware overhead, and is extensible to multiple network architectures and datasets.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Crossbar-Aware Dendritic Convolution (CADC).