Zero-Memory-Overhead Direct Convolutions

Updated 8 May 2026

Zero-memory-overhead direct convolutions are algorithms that compute outputs directly from input and filter tensors, eliminating the need for large intermediate buffers.
They use techniques like loop tiling, register blocking, and vectorized microkernels to optimize memory usage and enhance arithmetic intensity.
Empirical studies demonstrate 2–3× speedups over traditional im2col+GEMM methods, making them ideal for memory-constrained and high-performance environments.

Zero-memory-overhead direct convolution refers to a class of algorithms for evaluating convolution operations—central to deep neural networks—without the extra memory footprint characteristic of traditional “indirect” approaches such as im2col+GEMM. These direct strategies aim to maximize arithmetic intensity, exploit vectorization, and efficiently utilize hierarchical cache while entirely eliminating large intermediate packing buffers. The resulting performance superiority is especially pronounced in memory-constrained or bandwidth-limited environments such as embedded CPUs and for large-scale inference on general-purpose architectures (Ofir et al., 2024, Ferrari et al., 2023, Zhang et al., 2018, Georganas et al., 2018).

1. Motivation and Limitations of Indirect Convolutions

Indirect convolution methods, most famously im2col+GEMM, transform the convolutional operation into a matrix multiplication by “lowering” patches of input tensors into columns, yielding a buffer of shape $(c_{in} \cdot k_h \cdot k_w) \times (h' \cdot w')$ for input of shape $(c_{in} \times h \times w)$ and kernel of $(c_{out} \times c_{in} \times k_h \times k_w)$ , where $h' = h - k_h + 1$ , $w' = w - k_w + 1$ . This duplication of overlapping regions inflates memory requirements by a factor of $c_{in} k_h k_w$ , significantly exceeding the original input tensor size and often exceeding available L2/L3 cache capacity (Ofir et al., 2024).

Cache inefficiency emerges from the irregular aspect ratio (“skinny” matrices) of the im2col buffer, which leads to under-utilization of cache lines and poor prefetcher performance. Each matrix column is read by GEMM once, whereas output storage is random in cache, further reducing locality and prefetch efficiency. These factors contribute to bandwidth-bound performance, particularly on CPU architectures (Ofir et al., 2024, Zhang et al., 2018, Ferrari et al., 2023).

2. Fundamentals of Zero-Memory-Overhead Direct Convolutions

Direct convolution approaches sidestep the im2col buffer by computing output elements directly from input and filter tensors, maintaining strict locality and avoiding any large intermediate allocations. Typical strategies include reordering loop nests, smart register/caching blocking, vectorized microkernels, and “on-demand” packing of small tiles, always restricted to cache-resident working sets (Ofir et al., 2024, Ferrari et al., 2023, Zhang et al., 2018).

Mathematical Formulation

For input tensor $X \in \mathbb{R}^{c_{in} \times h \times w}$ and kernel $W \in \mathbb{R}^{c_{out} \times c_{in} \times k_h \times k_w}$ , the output is: $Y_{c_{out},i,j} = \sum_{c_{in}=1}^{c_{in}} \sum_{u=1}^{k_h} \sum_{v=1}^{k_w} X_{c_{in},\,i+u-1,\,j+v-1} \times W_{c_{out},\,c_{in},\,u,\,v}$ A direct approach preserves this structure—emphasizing strategic buffering (registers, per-tile cache) rather than global packing (Ofir et al., 2024, Zhang et al., 2018).

3. Algorithmic Techniques and Implementation

Several concrete algorithmic paradigms have emerged, with differences in how they block, vectorize, and tile the convolutional computation:

SMM-Conv: Scalar Matrix Multiplication with Zero Packing

In “SMM-Conv,” the key innovation involves decomposing the convolution into a sequence of scalar–matrix multiplications over contiguous “slices” of the input tensor. Rather than packing all patches, a small buffer $B \in \mathbb{R}^{h \times w'}$ holds each sliced region. For each vertical and horizontal kernel offset, valid $(c_{in} \times h \times w)$ 0 submatrices are multiply-accumulated, requiring only pointer offsetting and exploiting fast SIMD FMA instructions for the inner width loop. Overall, the buffer $(c_{in} \times h \times w)$ 1 is reused $(c_{in} \times h \times w)$ 2 times before refilling, guaranteeing extremely low memory overhead (essentially $(c_{in} \times h \times w)$ 3) (Ofir et al., 2024).

Key steps:

For each input channel and horizontal offset, extract $(c_{in} \times h \times w)$ 4 into $(c_{in} \times h \times w)$ 5.
For each vertical offset $(c_{in} \times h \times w)$ 6, pointer advance into $(c_{in} \times h \times w)$ 7 yields $(c_{in} \times h \times w)$ 8.
Each $(c_{in} \times h \times w)$ 9 is multiplied by the scalar kernel weight $(c_{out} \times c_{in} \times k_h \times k_w)$ 0 and accumulated into $(c_{out} \times c_{in} \times k_h \times k_w)$ 1.

CSA/CSO (SConv) and Vector-Based Packing

“SConv” introduces a code-generation and tiling analysis pipeline: Convolution Slicing Analysis (CSA) systematically determines the maximal tile sizes fitting in L1 (and recursively, L2/L3) cache. Its Slicing Optimization (CSO) stage emits a deeply nested loop macro-kernel with explicit packing for only those input/filter tiles about to be used, always under the L1 threshold ( $(c_{out} \times c_{in} \times k_h \times k_w)$ 2 KB typical). Vector-Based Packing (VBP) leverages hardware shift/permutation (e.g. AVX-512, POWER10 VSX) to exploit overlap between neighboring windows for stride-1 convolutions, further minimizing redundant data movement and eliminating the need for large packed matrices (Ferrari et al., 2023).

Loop/Blocking/Raster Strategies

Traditional direct convolution algorithms optimize via:

Spatial blocking: Output tiles (height $(c_{out} \times c_{in} \times k_h \times k_w)$ 3 width) that fit in L1/L2 (Zhang et al., 2018).
Register blocking: Output channel bursts $(c_{out} \times c_{in} \times k_h \times k_w)$ 4 SIMD width.
Depth-first kernel application: Inner loops over kernel/channel are unrolled for maximum FMA utilization.
Data layout: “Channels-last” and blocked layouts enable unit stride for memory-efficient vectorization (Georganas et al., 2018).

Implementation is accomplished either manually (hand-written SIMD microkernels) or via dynamic code generation (JIT), with inner-most loops engineered to reside in registers or L1 at all times (Georganas et al., 2018).

4. Memory Complexity, Arithmetic Intensity, and Roofline Analysis

Memory Footprint

The memory requirement is restricted to the input, kernel, and output tensors, plus (at most) cache-resident working tiles:

im2col+GEMM: $(c_{out} \times c_{in} \times k_h \times k_w)$ 5
Zero-overhead direct: $(c_{out} \times c_{in} \times k_h \times k_w)$ 6 (SMM-Conv) or $(c_{out} \times c_{in} \times k_h \times k_w)$ 7(small L1-sized buffers + outputs) (SConv)

For $(c_{out} \times c_{in} \times k_h \times k_w)$ 8, the im2col memory exceeds the zero-overhead direct approach by a factor of $(c_{out} \times c_{in} \times k_h \times k_w)$ 9 or more, with typical savings exceeding $h' = h - k_h + 1$ 0 on modern convolutional layers (Ofir et al., 2024, Ferrari et al., 2023).

Arithmetic Intensity

Direct methods maximize reuse: $h' = h - k_h + 1$ 1 Direct methods approach roofline limits, being either compute- or bandwidth-bound depending on architectural ratios. Empirical intensity in, e.g., ResNet-50 3 $h' = h - k_h + 1$ 23 convolution can reach $h' = h - k_h + 1$ 3 FLOP/byte (Georganas et al., 2018).

5. Experimental Results and Performance

The elimination of memory overhead translates into demonstrable performance gains:

Model	im2col+GEMM	SMM-Conv	SConv (mean)	Direct SIMD JIT	Observed Speedup
AlexNet	0.4608s	0.1348s	N/A	N/A	$h' = h - k_h + 1$ 4 (Ofir et al., 2024)
VGG	2.3670s	1.3535s	N/A	N/A	$h' = h - k_h + 1$ 5 (Ofir et al., 2024)
YoloV3	0.4478s	0.2889s	N/A	N/A	$h' = h - k_h + 1$ 6 (Ofir et al., 2024)
ONNXNet (x86 mean)	N/A	N/A	$h' = h - k_h + 1$ 7– $h' = h - k_h + 1$ 8 faster	N/A	$h' = h - k_h + 1$ 9– $w' = w - k_w + 1$ 0 packing (Ferrari et al., 2023)
Skylake-SP	1.2 TF/s (im2col)	N/A	N/A	3.1 TF/s (direct)	$w' = w - k_w + 1$ 1 (Georganas et al., 2018)

Direct convolution consistently reduces both intermediate memory footprint and DRAM traffic, delivering $w' = w - k_w + 1$ 2– $w' = w - k_w + 1$ 3 end-to-end speedups for inference tasks and maintaining high scaling even as thread count increases (Ofir et al., 2024, Zhang et al., 2018, Ferrari et al., 2023, Georganas et al., 2018). Removal of the im2col stage increases the relative benefit as FMA throughput grows (e.g., on POWER10 MMA) (Ferrari et al., 2023).

6. Practical Considerations and Hardware Aspects

Cache Optimizations

Zero-memory-overhead approaches structure loops and tile sizes to guarantee that transient working sets reside in L1 or L2. The compact buffer or register allocations are reused $w' = w - k_w + 1$ 4 (or more) times before refill, maximizing locality (Ofir et al., 2024).

Multi-threading

Task parallelism is exposed in the output-channel, batch, or spatial tile dimensions. Thread-private buffers avoid false sharing, and no locks are required since output slices are disjoint. Scaling remains nearly linear until core-saturation (Ofir et al., 2024, Zhang et al., 2018).

ISA Extensions and Vectorization

Advanced SIMD instructions (e.g., FMA, vector shifts) are leveraged for inner-loop fusion and input-tile packing, with special handling for stride-1 when hardware supports vector-register serial shifting (e.g., POWER10 VSX, x86 AVX-512). Code generation can elide all loop-boundary checks and branch overhead (Ferrari et al., 2023, Georganas et al., 2018).

7. Limitations, Extensions, and Open Problems

Zero-memory-overhead direct convolution is most effective under:

Inference scenarios (filter repacking is trivial at compile time; training requires on-the-fly tiling) (Ferrari et al., 2023).
Moderate to large output channel counts (for sufficient register and FMA utilization) (Ofir et al., 2024).
Stride-1 or small strides (for optimal VBP use); stride $w' = w - k_w + 1$ 51 requires scalar loads or microtiles (Ferrari et al., 2023).

Current SMM-Conv and SConv algorithms are written for “valid” convolution; extension to “same” padding involves lightweight zero-insertion during extraction. Hybridization with Winograd or FFT may be optimal for large kernels or highly compute-bound layers. Generalization to GPU architectures, backward-pass convolutions, and autotuning of blocking parameters remain active areas of investigation (Ofir et al., 2024, Ferrari et al., 2023, Zhang et al., 2018).

References

"SMM-Conv: Scalar Matrix Multiplication with Zero Packing for Accelerated Convolution" (Ofir et al., 2024)
"Advancing Direct Convolution using Convolution Slicing Optimization and ISA Extensions" (Ferrari et al., 2023)
"High Performance Zero-Memory Overhead Direct Convolutions" (Zhang et al., 2018)
"Anatomy Of High-Performance Deep Learning Convolutions On SIMD Architectures" (Georganas et al., 2018)

Markdown Report Issue Upgrade to Chat

References (4)

SMM-Conv: Scalar Matrix Multiplication with Zero Packing for Accelerated Convolution (2024)

Advancing Direct Convolution using Convolution Slicing Optimization and ISA Extensions (2023)

High Performance Zero-Memory Overhead Direct Convolutions (2018)

Anatomy Of High-Performance Deep Learning Convolutions On SIMD Architectures (2018)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Zero-Memory-Overhead Direct Convolutions.

Zero-Memory-Overhead Direct Convolutions

1. Motivation and Limitations of Indirect Convolutions

2. Fundamentals of Zero-Memory-Overhead Direct Convolutions

Mathematical Formulation

3. Algorithmic Techniques and Implementation

SMM-Conv: Scalar Matrix Multiplication with Zero Packing

CSA/CSO (SConv) and Vector-Based Packing

Loop/Blocking/Raster Strategies

4. Memory Complexity, Arithmetic Intensity, and Roofline Analysis

Memory Footprint

Arithmetic Intensity

5. Experimental Results and Performance

6. Practical Considerations and Hardware Aspects

Cache Optimizations

Multi-threading

ISA Extensions and Vectorization

7. Limitations, Extensions, and Open Problems

References

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Zero-Memory-Overhead Direct Convolutions

1. Motivation and Limitations of Indirect Convolutions

2. Fundamentals of Zero-Memory-Overhead Direct Convolutions

Mathematical Formulation

3. Algorithmic Techniques and Implementation

SMM-Conv: Scalar Matrix Multiplication with Zero Packing

CSA/CSO (SConv) and Vector-Based Packing

Loop/Blocking/Raster Strategies

4. Memory Complexity, Arithmetic Intensity, and Roofline Analysis

Memory Footprint

Arithmetic Intensity

5. Experimental Results and Performance

6. Practical Considerations and Hardware Aspects

Cache Optimizations

Multi-threading

ISA Extensions and Vectorization

7. Limitations, Extensions, and Open Problems

References

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research