Convolution Slicing Analysis Techniques

Updated 29 November 2025

Convolution slicing analysis is a method for partitioning and optimizing convolution operations to improve memory access, parallelism, and computational efficiency.
It employs techniques such as Tensor Slicing Optimization, MLIR transform dialects, and cache blocking to achieve speed-ups and near-peak hardware utilization.
The approach spans practical applications in neural inference and geometry processing as well as theoretical advances in FFT, bit-sliced arithmetic, and image distribution metrics.

Convolution Slicing Analysis encompasses methodologies for partitioning convolution operations or data associated with convolutions, with the goal of optimizing memory access, throughput, parallelism, or mathematical properties across diverse domains including neural processor architectures, machine learning compilation, streaming geometry computation, signal processing, and mathematical analysis of transforms. It operates at the intersection of computational cost modeling, parallel workload division, domain-specific mathematical slicing, and efficient algorithmic transformation. The implementations and theoretical underpinnings span compiler optimization for neural accelerators (Sousa et al., 2023), MLIR-based convolution codegen (Ferrari et al., 22 Nov 2025, Ferrari et al., 2023), streaming convolution in geometry (Liu et al., 2021), bit-slicing in FPGA arithmetic (N et al., 25 Apr 2024), and mathematical slicing for generalized Fourier analysis (Cnudde et al., 2015), among others.

1. Convolution-Slicing for Multicore Neural Processor Units

In highly-constrained multicore NPUs, slicing analysis is central to mapping CNN workloads onto on-chip memory-limited hardware while maximizing data parallelism and minimizing both DRAM transactions and host-device transfers. Tensor Slicing Optimization (TSO) formalizes the convolution-slicing problem as the selection of slice dimensions $(T_N, T_H, T_L, T_M, T_R, T_C)$ that fit input, weight, and output tiles into scratchpads of fixed size, ensuring balance across cores and minimal overall execution time $T_{\text{total}} = T_{\text{MAC}} + T_{\text{DRAM}} + T_{\text{sw}}$ (Sousa et al., 2023).

A two-tier memory cost model distinguishes raw bandwidth and burst-mode DRAM access:

TSO-burst: A tile’s memory cost includes $n_{\text{bursts}} \cdot t_{\text{CAS}} + S_{\text{tile}}/\text{BW}$ , with $n_{\text{bursts}}$ computed per DRAM row and burst-length, favoring wide row-major tiles.
TSO-noburst: Cost is $S_{\text{tile}}/\text{BW}$ , omitting burst effects.

The search algorithm exhaustively sweeps slicing/tile/scheduling parameter space (partitioning kernels/output-channels/input features; input/weight/output stationarity), pruned by memory fit constraints, and selects tilings that approach peak MAC unit utilization through alignment of tile sizes with hardware SIMD width.

Experimental results (NMP, TF-XLA, Glow) demonstrate:

Up to 21.7% speed-up for TSO-burst vs. no-burst slicing on InceptionV3, and 15% average.
Roofline analysis: TSO-generated code achieves near-peak effective bandwidth and MAC throughput.
Portability: The cost model translates unmodified to Glow IR, validating its backend-agnostic applicability.

2. Convolution Slicing in MLIR Transform Dialects

SConvTransform extends MLIR's Transform dialect to optimize 2D convolutions through declarative convolution slicing analysis (CSA) (Ferrari et al., 22 Nov 2025). The CSA pass analyzes convolution operator attributes (input/filter/output shape, cache hierarchy), determines optimal tiling for L1, L2, L3 caches (tile sizes $N_c$ , $K_2$ , $K_3$ ), and emits parametric affine maps for packing and tiling.

Key principles include:

Static partitioning into tiles that maximize register and cache reuse.
Recursive edge-case handling: when dimensions do not divide evenly by tile size, CSA triggers recursive splitting, yielding two MLIR GenericOps per irregular dimension (main tile + remainder) with minimal affine fixup for correctness.
Pipeline stages: convolution normalization, slicing analysis, edge-case split, two-level loop tiling, parametric packing via affine maps, lowering to microkernel/BLAS, and finally to LLVM IR.

Performance validation reaches 60% (Arm SME) to 67% (Intel AVX512) of theoretical peak, confirming CSA’s effectiveness for codegen. The CSA-guided transform is modular for integration with other MLIR extensions. Overhead observed in epilogue tiles and lack of automated padding suggests directions for future enhancement.

3. Cache Blocking and Direct Convolution Macro-Kernel Slicing

Convolution Slicing Analysis for direct, matrix-free convolution code generation is formalized as a multidimensional cache blocking problem (Ferrari et al., 2023). Each ConvOp is partitioned with block parameters $(R_b, C_b, K_b)$ corresponding to output windows, output filters, and input channels, subject to hierarchical cache-size constraints.

A cost model quantifies cold/hot memory transfers across DRAM/L3/L2, with reuse factors for input and filter tiles. The CSA heuristic "halves" the largest block dimension until the aggregate tile size fits the L1 cache, then solves for secondary levels (L2, L3) and compares scheduling strategies (Input-Stationary vs. Weight-Stationary). The selected blocking and scheduling metadata feed a downstream codegen pass (CSO) that unfolds the original convolution into a tiled, register-packed macro-kernel without Im2Col or GEMM, yielding substantial inference time speedup relative to BLAS-based approaches.

4. Slicing Analysis in FFT and Bit-Sliced Convolution

Bit-slicing analysis for FFT and linear convolution implementation transforms large multiplies into parallelized LUT lookups using the Bit-Slicing Multiplier (BSM) architecture (N et al., 25 Apr 2024). Inputs and kernels are partitioned into parallel $p$ -bit slices, with convolution sum rewritten as two-level accumulation over slice products. BSM hardware arranges $t^2$ LUTs for $t=B/p$ slices, shifting and accumulating partial products.

When mapped onto FPGAs, BSM-based convolution achieves real-time throughput with reduced dependency on DSP blocks (0 vs. 15 for 15-tap conv), triple the LUT consumption, and 32-cycle latency for 12-bit data (vs. 15 for conventional). LUT resource usage scales steeply with bit-width and slice count, delimiting practical design points for DSP-constrained architectures.

5. Streaming Convolution Slicing in Computational Geometry

Convolution-surface slicing for adaptive lattice structures replaces explicit mesh representation with local evaluation of implicit convolution fields defined on strut graphs (Liu et al., 2021). Each slicing plane maintains only the current set of struts whose swept spheres intersect the slice, enabling evaluation of the implicit field via closed-form convolution per active edge. The streaming slicing algorithm grows/prunes the active edge set as the slicing plane advances, yielding $O(|E_{act}|+M)$ memory footprint per slice ( $M$ = pixels per slice), even at 100M struts.

The convolution kernel's local, compact support ensures blending at intersections without stress concentration, outperforming mesh, distance-field, and Gaussian blending in both geometric accuracy and finite-element stress mitigation.

| Model (struts) | MC mesh (MB) | Convolution slice (MB) | Max |E_act| | Slicing time/slice (s) | |----------------|-------------|------------------------|------------|-----------------------| | Bunny (1.2e5) | 12,551 | 3.4 | 26,069 | 26.5 | | Kitten-HR (1e8)| 83,225 | 475.6 | 1,061,866 | 347.2 |

6. Mathematical Slicing in Functional Analysis

Slice-based convolution analysis appears in the construction of the slice Fourier transform for slice monogenic functions (Cnudde et al., 2015). The kernel is derived via Mehler series, leading to two non-commutative forms: Mustard convolution ( $*_{\!s}$ ) and convolution via generalized translation operators. The Mustard convolution ensures multiplicativity under the slice Fourier transform: $\Fslash[f*_{\!s}g] = \Fslash[f]\Fslash[g]$, generalized via explicit formulae in Clifford algebra. The two forms coincide under integration against translated kernels, yielding a unified convolutional theory in higher-dimensional function spaces.

7. Slicing in Image Distribution Metrics

Convolutional slicing operators underlie the definition of convolution sliced Wasserstein distances (CSW), generalizing classical sliced Wasserstein by projecting images through convolutional kernels—possibly strided, dilated, or non-linear (Nguyen et al., 2022). CSW operators map images to scalars via $N$ -layer convolutions, and discrepancies between distributions are computed as Monte Carlo averages over such random convolutional projections.

CSW achieves sample complexity and computational cost on par with classical SW for strided/dilated cases but reduces projection-memory from $O(c d^2)$ to $O(c+\log d)$ , with improved performance in image-based generative modeling tasks.

Convolution Slicing Analysis thus constitutes a foundational paradigm for efficient and scalable implementation, transformation, and theoretical analysis of convolution operations under hardware, algorithmic, and mathematical constraints, with cross-disciplinary impact in neural inference, geometry processing, arithmetic design, and distribution-metric theory.