SConvOp: Specialized Convolution Operators

Updated 29 November 2025

SConvOp is a suite of specialized convolution operators that optimize dense tensor computations by replacing traditional Im2Col+GEMM with macro-kernel techniques leveraging static slicing, tiling, and hardware-specific microkernels.
For non-Euclidean domains such as graphs and point clouds, SConvOp employs structured convolution strategies that enable exact CNN weight reuse and efficient sparse connectivity handling.
The spherical sifting convolution variant uses harmonic transforms for anisotropic filtering on spherical signals, achieving scalable computation and favorable performance on complex geometries.

SConvOp is a term that denotes several distinct convolution operators specialized for different domains and computational targets. Within the academic literature, SConvOp refers to: (1) direct-convolution macro-kernel generation and optimization within MLIR/LLVM compiler toolchains for dense tensors; (2) structured convolutions on irregular graphs or non-Euclidean domains (including graphs, superpixels, surfaces, and point clouds); and (3) harmonically-defined sifting convolution for spherical signals. Each SConvOp formulation exhibits domain-specific mathematical definitions, algorithmic strategies, and hardware implementation considerations.

1. MLIR/LLVM SConvOp for 2D Dense Convolution

SConvOp, as implemented in MLIR Transform-dialect extensions such as SConvTransform, provides a declarative lowering mechanism for 2D convolution in compiler pipelines. The primary motivation is to replace Im2Col+GEMM patterns with an architecture-aware, macro-kernel–driven, direct convolution, leveraging static slicing, tiling, and packing (Ferrari et al., 22 Nov 2025, Ferrari et al., 2023).

Formal Transformation Pipeline

SConvOp operates on Linalg's convolution payloads in NCHW/FCHW form and is parameterized by both convolutional shape data and architecture-specific microkernel hints (e.g., tile and cache sizes, vector unit widths). The operator rewrites the original convolution into a sequence of optimized generic operations using the following pipeline:

Normalization: Collapse spatial dimensions to enable block processing.
Convolution Slicing Analysis (CSA): Statically determine a scheduling strategy (input-stationary or weight-stationary) and select tile sizes ( $N_c$ , $K_2$ , $K_3$ ) such that input/filter/output tiles nest efficiently within the L1/L2/L3 cache hierarchy.
Edge Case Splitting: Apply splitting where dimensions are not divisible by tile sizes, yielding epilogue kernels with affine index corrections but without mainline packing/tiling.
Hierarchical Tiling: Construct outer tiling loops for macro blocks and inner tiling for micro-kernel granularity, expressed via SCF and affine maps.
Packing and Multipacking: Compute parametric affine transformations for input and filter packing into hardware-preferred layouts. Multipacking handles multiple tiles when beneficial for reuse.
Microkernel Lowering and Bufferization: Lower the innermost tiles to calls to highly optimized sgemm-like microkernels (e.g., OpenBLAS, MMA, AVX-512 FMA), bufferize all intermediary tensors, and emit final LLVM IR.

The transformation is fully declarative and composable within the MLIR Transform dialect, enabling static analysis and modular extension (Ferrari et al., 22 Nov 2025).

2. Algorithmic Details: Slicing, Packing, and Affine Modeling

Tiling and Packing Equations

Tile sizes are selected by maximizing the largest $k$ such that $k \cdot F_h F_w \cdot S_{elem} \leq L1_{cap}$ (and corresponding conditions for $K_2$ , $K_3$ at L2 and L3). Parametric affine maps carry all index calculations for direct window-to-memory mappings. Notably:

Filter multipacking index: $iT_f = iN_t \cdot N_f + iN_f$
Input-packing shapes: originally $N_c \times F_h \times (N_{win} + F_w - 1)$ , packed as $N_c \times F_h \times F_w \times N_{win}$
Flat-to-spatial calculation includes stride and dilation, supporting arbitrary convolution parameterizations (Ferrari et al., 22 Nov 2025)

Vector-Based Packing (VBP)

For stride-1 convolutions, SConvOp exploits SIMD “shift” instructions for efficient, on-demand packing, reducing repeated loads of overlapping windows in registers (Ferrari et al., 2023). This is especially beneficial on architectures with wide vector units and supports nearly full utilization of compute peak (reported up to 67% on AVX-512 and 59.6% on ARM SME for aligned cases).

Handling Irregularities

Edge cases (epilogue tiles) are split using transform.split and index adjustments via affine maps. While the main pipeline is fully tiled and packed, remainder regions are handled by simpler index fixups to maintain correctness in unaligned fragments of the input or output domain (Ferrari et al., 22 Nov 2025).

3. Domain-Generalized SConvOp: Graphs, Surfaces, and Point Clouds

SConvOp has also been established as a unified convolutional operator for irregular, non-grid domains (Hart et al., 2022, Jack et al., 2020):

SelectionConv for Positional Graphs

SelectionConv (also denoted SConvOp) formalizes convolution on graphs by partitioning edges via a spatial “selection” function $s: V\times V \to \{0,1,\dots,M^2-1\}$ , emulating the spatial structure of a $M\times M$ image kernel. For each label, a sparse adjacency matrix $S_m$ is built, and feature propagation is realized via

$X^{(k+1)} = \sum_{m=0}^{M^2-1} \tilde S_m X^{(k)} W_m$

where $W_m$ are direction-specific weight matrices copied directly from discrete CNN kernels, supporting exact weight transfer and permutation invariance. This allows standard 2D CNNs to operate natively on superpixels, spherical discretizations, or arbitrary mesh graphs without retraining (Hart et al., 2022).

Sparse Convolutions on Continuous Domains

For point clouds or event streams, SConvOp is implemented as

$F' = \sum_{m=1}^M N^{(m)} F \Theta^{(m)}$

where $N^{(m)}$ encodes ball-search neighborhoods with polynomial basis weighting, and $\Theta^{(m)}$ are learnable kernel weights. All local geometry and neighborhoods are encoded in very sparse matrices, supporting efficient sparse-dense matmuls and block-diagonal batching. This paradigm is extensible to continuous-time event streams with per-event exponential decays or similar kernels (Jack et al., 2020).

4. Sifting Convolution Operator on the Sphere

An alternative SConvOp formulation is the sifting convolution for spherical domains (Roddy et al., 2020). The key components are:

Translation Operator: $\mathcal{T}_{\omega'}Y_{\ell m}(\omega) = Y_{\ell m}(\omega')Y_{\ell m}(\omega)$ , an analogue of Euclidean translation, acting on harmonics.
Sifting Convolution: Defined as $(f\circledcirc g)(\omega) = \langle \mathcal{T}_\omega f, g\rangle = \sum_{\ell=0}^{L-1}\sum_{m=-\ell}^{\ell} f_{\ell m} g^*_{\ell m} Y_{\ell m}(\omega)$ .
Computational Complexity: Dominated by two spherical harmonic transforms and a coefficient-wise multiplication, with total cost scaling as $\mathcal{O}(L^3)$ .

This operator uniquely supports anisotropic (directional) kernels, ensures outputs reside on the sphere (not SO(3)), exhibits (conjugate) commutativity, and achieves favorable computational scaling relative to other spherical convolution paradigms (Roddy et al., 2020).

5. Performance Metrics and Empirical Results

Empirical evaluations across the SConvOp family demonstrate robust gains when replacing dense Im2Col+GEMM with tiling and packing optimized SConvOp variants:

MLIR SConvOp achieves up to 67% peak on Intel AVX-512 and 60% on ARM SME for aligned cases; median utilization is 28.7% (AVX-512) and 15.5% (SME). Model inference speedups range 9–42%, with direct convolution kernels accelerated by 12–46% (Ferrari et al., 22 Nov 2025, Ferrari et al., 2023).
Packing time is reduced by 2–7 $\times$ over Im2Col + BLAS on x86 and POWER10.
SelectionConv/Sparse SConvOp enables weight transfer to non-grid domains, efficient handling of millions of nodes for depth and style transfer, seam-free mesh texture transfer, and sub-second event stream updates (Hart et al., 2022, Jack et al., 2020).

6. Comparative Properties Across SConvOp Variants

Operator Variant	Domain	Key Technical Features
MLIR SConvOp	Dense 2D tensors	Static slicing, tiling, packing; SIMD; up to 67% peak; microkernel lowering
SelectionConv SConvOp	Positional graphs, surfaces	Directionally partitioned edge aggregation; exact CNN weight reuse; O(
Sparse SConvOp	Point clouds, event streams	Ball-neighborhood, polynomial basis kernel, sparse-matrix batching, event-time extensions
Spherical Sifting SConvOp	$L^2(S^2)$ , harmonic space	Anisotropic convolution; outputs on sphere; harmonic diagonalization; O( $L^3$ ) scaling

The design space addressed by SConvOp encompasses both high-performance convolution macro-kernels for accelerated inference on CPU/GPU and algebraically-principled extensions to non-Euclidean and irregular data structures. This breadth demonstrates the flexibility of the "SConvOp" paradigm for both compiler-grade and modeling-theoretic advances.

7. Integration, Extensibility, and Future Directions

The MLIR-based SConvOp is implemented as a Transform-dialect extension that composes with existing MLIR Transform ops and supports integration with further compiler and accelerator targets (IREE, TVM, Torch-MLIR). All tiling and packing logic is encoded declaratively and is compatible with automated static analysis. Directions for extension include automatic epilogue absorption, deeper integration with vector/matrix IR dialects, and support for grouped and strided convolutions (Ferrari et al., 22 Nov 2025).

This suggests a widening adoption of statically scheduled, hardware-aware convolution lowering pipelines and the extension of convolutional operators to manifold, graph, and event domains. The SConvOp framework provides a unifying abstraction for these disparate requirements, encoding mathematical, algorithmic, and hardware-oriented principles in domain-appropriate forms.