Zero-Memory-Overhead Direct Convolutions
- Zero-memory-overhead direct convolutions are algorithms that compute outputs directly from input and filter tensors, eliminating the need for large intermediate buffers.
- They use techniques like loop tiling, register blocking, and vectorized microkernels to optimize memory usage and enhance arithmetic intensity.
- Empirical studies demonstrate 2–3× speedups over traditional im2col+GEMM methods, making them ideal for memory-constrained and high-performance environments.
Zero-memory-overhead direct convolution refers to a class of algorithms for evaluating convolution operations—central to deep neural networks—without the extra memory footprint characteristic of traditional “indirect” approaches such as im2col+GEMM. These direct strategies aim to maximize arithmetic intensity, exploit vectorization, and efficiently utilize hierarchical cache while entirely eliminating large intermediate packing buffers. The resulting performance superiority is especially pronounced in memory-constrained or bandwidth-limited environments such as embedded CPUs and for large-scale inference on general-purpose architectures (Ofir et al., 2024, Ferrari et al., 2023, Zhang et al., 2018, Georganas et al., 2018).
1. Motivation and Limitations of Indirect Convolutions
Indirect convolution methods, most famously im2col+GEMM, transform the convolutional operation into a matrix multiplication by “lowering” patches of input tensors into columns, yielding a buffer of shape for input of shape and kernel of , where , . This duplication of overlapping regions inflates memory requirements by a factor of , significantly exceeding the original input tensor size and often exceeding available L2/L3 cache capacity (Ofir et al., 2024).
Cache inefficiency emerges from the irregular aspect ratio (“skinny” matrices) of the im2col buffer, which leads to under-utilization of cache lines and poor prefetcher performance. Each matrix column is read by GEMM once, whereas output storage is random in cache, further reducing locality and prefetch efficiency. These factors contribute to bandwidth-bound performance, particularly on CPU architectures (Ofir et al., 2024, Zhang et al., 2018, Ferrari et al., 2023).
2. Fundamentals of Zero-Memory-Overhead Direct Convolutions
Direct convolution approaches sidestep the im2col buffer by computing output elements directly from input and filter tensors, maintaining strict locality and avoiding any large intermediate allocations. Typical strategies include reordering loop nests, smart register/caching blocking, vectorized microkernels, and “on-demand” packing of small tiles, always restricted to cache-resident working sets (Ofir et al., 2024, Ferrari et al., 2023, Zhang et al., 2018).
Mathematical Formulation
For input tensor and kernel , the output is: A direct approach preserves this structure—emphasizing strategic buffering (registers, per-tile cache) rather than global packing (Ofir et al., 2024, Zhang et al., 2018).
3. Algorithmic Techniques and Implementation
Several concrete algorithmic paradigms have emerged, with differences in how they block, vectorize, and tile the convolutional computation:
SMM-Conv: Scalar Matrix Multiplication with Zero Packing
In “SMM-Conv,” the key innovation involves decomposing the convolution into a sequence of scalar–matrix multiplications over contiguous “slices” of the input tensor. Rather than packing all patches, a small buffer holds each sliced region. For each vertical and horizontal kernel offset, valid 0 submatrices are multiply-accumulated, requiring only pointer offsetting and exploiting fast SIMD FMA instructions for the inner width loop. Overall, the buffer 1 is reused 2 times before refilling, guaranteeing extremely low memory overhead (essentially 3) (Ofir et al., 2024).
Key steps:
- For each input channel and horizontal offset, extract 4 into 5.
- For each vertical offset 6, pointer advance into 7 yields 8.
- Each 9 is multiplied by the scalar kernel weight 0 and accumulated into 1.
CSA/CSO (SConv) and Vector-Based Packing
“SConv” introduces a code-generation and tiling analysis pipeline: Convolution Slicing Analysis (CSA) systematically determines the maximal tile sizes fitting in L1 (and recursively, L2/L3) cache. Its Slicing Optimization (CSO) stage emits a deeply nested loop macro-kernel with explicit packing for only those input/filter tiles about to be used, always under the L1 threshold (2 KB typical). Vector-Based Packing (VBP) leverages hardware shift/permutation (e.g. AVX-512, POWER10 VSX) to exploit overlap between neighboring windows for stride-1 convolutions, further minimizing redundant data movement and eliminating the need for large packed matrices (Ferrari et al., 2023).
Loop/Blocking/Raster Strategies
Traditional direct convolution algorithms optimize via:
- Spatial blocking: Output tiles (height 3 width) that fit in L1/L2 (Zhang et al., 2018).
- Register blocking: Output channel bursts 4 SIMD width.
- Depth-first kernel application: Inner loops over kernel/channel are unrolled for maximum FMA utilization.
- Data layout: “Channels-last” and blocked layouts enable unit stride for memory-efficient vectorization (Georganas et al., 2018).
Implementation is accomplished either manually (hand-written SIMD microkernels) or via dynamic code generation (JIT), with inner-most loops engineered to reside in registers or L1 at all times (Georganas et al., 2018).
4. Memory Complexity, Arithmetic Intensity, and Roofline Analysis
Memory Footprint
The memory requirement is restricted to the input, kernel, and output tensors, plus (at most) cache-resident working tiles:
- im2col+GEMM: 5
- Zero-overhead direct: 6 (SMM-Conv) or 7(small L1-sized buffers + outputs) (SConv)
For 8, the im2col memory exceeds the zero-overhead direct approach by a factor of 9 or more, with typical savings exceeding 0 on modern convolutional layers (Ofir et al., 2024, Ferrari et al., 2023).
Arithmetic Intensity
Direct methods maximize reuse: 1 Direct methods approach roofline limits, being either compute- or bandwidth-bound depending on architectural ratios. Empirical intensity in, e.g., ResNet-50 323 convolution can reach 3 FLOP/byte (Georganas et al., 2018).
5. Experimental Results and Performance
The elimination of memory overhead translates into demonstrable performance gains:
| Model | im2col+GEMM | SMM-Conv | SConv (mean) | Direct SIMD JIT | Observed Speedup |
|---|---|---|---|---|---|
| AlexNet | 0.4608s | 0.1348s | N/A | N/A | 4 (Ofir et al., 2024) |
| VGG | 2.3670s | 1.3535s | N/A | N/A | 5 (Ofir et al., 2024) |
| YoloV3 | 0.4478s | 0.2889s | N/A | N/A | 6 (Ofir et al., 2024) |
| ONNXNet (x86 mean) | N/A | N/A | 7–8 faster | N/A | 9–0 packing (Ferrari et al., 2023) |
| Skylake-SP | 1.2 TF/s (im2col) | N/A | N/A | 3.1 TF/s (direct) | 1 (Georganas et al., 2018) |
Direct convolution consistently reduces both intermediate memory footprint and DRAM traffic, delivering 2–3 end-to-end speedups for inference tasks and maintaining high scaling even as thread count increases (Ofir et al., 2024, Zhang et al., 2018, Ferrari et al., 2023, Georganas et al., 2018). Removal of the im2col stage increases the relative benefit as FMA throughput grows (e.g., on POWER10 MMA) (Ferrari et al., 2023).
6. Practical Considerations and Hardware Aspects
Cache Optimizations
Zero-memory-overhead approaches structure loops and tile sizes to guarantee that transient working sets reside in L1 or L2. The compact buffer or register allocations are reused 4 (or more) times before refill, maximizing locality (Ofir et al., 2024).
Multi-threading
Task parallelism is exposed in the output-channel, batch, or spatial tile dimensions. Thread-private buffers avoid false sharing, and no locks are required since output slices are disjoint. Scaling remains nearly linear until core-saturation (Ofir et al., 2024, Zhang et al., 2018).
ISA Extensions and Vectorization
Advanced SIMD instructions (e.g., FMA, vector shifts) are leveraged for inner-loop fusion and input-tile packing, with special handling for stride-1 when hardware supports vector-register serial shifting (e.g., POWER10 VSX, x86 AVX-512). Code generation can elide all loop-boundary checks and branch overhead (Ferrari et al., 2023, Georganas et al., 2018).
7. Limitations, Extensions, and Open Problems
Zero-memory-overhead direct convolution is most effective under:
- Inference scenarios (filter repacking is trivial at compile time; training requires on-the-fly tiling) (Ferrari et al., 2023).
- Moderate to large output channel counts (for sufficient register and FMA utilization) (Ofir et al., 2024).
- Stride-1 or small strides (for optimal VBP use); stride51 requires scalar loads or microtiles (Ferrari et al., 2023).
Current SMM-Conv and SConv algorithms are written for “valid” convolution; extension to “same” padding involves lightweight zero-insertion during extraction. Hybridization with Winograd or FFT may be optimal for large kernels or highly compute-bound layers. Generalization to GPU architectures, backward-pass convolutions, and autotuning of blocking parameters remain active areas of investigation (Ofir et al., 2024, Ferrari et al., 2023, Zhang et al., 2018).
References
- "SMM-Conv: Scalar Matrix Multiplication with Zero Packing for Accelerated Convolution" (Ofir et al., 2024)
- "Advancing Direct Convolution using Convolution Slicing Optimization and ISA Extensions" (Ferrari et al., 2023)
- "High Performance Zero-Memory Overhead Direct Convolutions" (Zhang et al., 2018)
- "Anatomy Of High-Performance Deep Learning Convolutions On SIMD Architectures" (Georganas et al., 2018)