Non-GEMM Vector Primitives
- Non-GEMM vector primitives are defined as specialized operations on vectors and small matrices that bypass the traditional GEMM approach to optimize memory access and dataflow.
- They include a range of functions from Level-1 BLAS (DOT, AXPY, SCAL) to sliding-window convolutions and Winograd transforms, making them integral to HPC and deep learning applications.
- Their performance benefits stem from tailored hardware support and ISA optimizations that accelerate vector reductions, segmented scans, and high-dimensional binding operations.
Non-GEMM vector primitives are algorithmic and architectural operations on vectors or small matrices that do not reduce to general dense matrix-matrix multiplication (GEMM). They encompass a broad range of fundamental computational patterns in scientific computing, data analytics, machine learning, and high-performance architectures. Unlike GEMM, which is the principal layer-3 BLAS primitive for dense linear algebra, non-GEMM primitives include vector and matrix-vector BLAS (dot, AXPY, SCAL, GEMV), high-dimensional binding operations (circular convolution/correlation), sliding-window convolutions, Winograd transforms, segmented/parallel scan, vector outer product, compression, differentiation, reductions, and vectorized associative operations. These primitives are directly accelerated in modern CPUs, NPUs, and accelerators, often exposing distinct memory, arithmetic, and pipelining characteristics relative to GEMM.
1. Core Classes of Non-GEMM Vector Primitives
Non-GEMM primitives span several operational categories, each underpinned by specific algebraic or algorithmic structures:
- Level-1 BLAS vector operations: dot product (DOT; ), scalar-vector multiply (SCAL; ), vector addition (AXPY; ) (Progsch et al., 2011, Singh et al., 2021).
- Level-2 BLAS matrix-vector operations: General Matrix-Vector product (GEMV; ), outer product (GER; ) (Singh et al., 2021).
- Circular convolution/correlation: Operators such as and in hyperdimensional computing, critical for associative memory and binding (Alpay et al., 28 Jan 2026).
- Stencil and outer-product-based patterns: Application of vector outer products for scatter-mode stencil computation maps directly onto new ISA primitives (Zhao et al., 2023).
- Sliding-window and scan/sum: Vector sliding, scan, and segmented-scan/segmented-sum primitives with hierarchical or speculative block structure (Sobczyk et al., 30 Jun 2025, Snytsar, 2023).
- Nonlinear and reduction ops: argmax, sum, exponentiation, reciprocal, masked/select, and top-k; direct hardware support for reductions and masked updates in NPU ISAs (Lou et al., 28 Jan 2026).
The critical distinction is that these primitives do not leverage the full matrix algebra, instead exposing spatial or dataflow locality, sparse accesses, or control flow not naturally handled by GEMM accelerators.
2. Algorithmic Construction and Memory/Compute Patterns
Each non-GEMM primitive is characterized by the structure and flow of computation and memory:
- Vector primitives (DOT, AXPY, SCAL) are realized via unrolled SIMD loops, leveraging alignment, register blocking, and instruction package grouping. For SALT, unrolled U-lane blocks exploit SIMD alignment, initiate reductions in registers, and minimize memory stalls (Progsch et al., 2011).
- Matrix-vector (GEMV) and outer product (GER) operate with memory-aware blocking. GER is optimized for row-major stride-1 writes and thread-parallelization on columns, while GEMV uses register-packed micro-kernels and cache blocking for optimal data reuse (Singh et al., 2021).
- Sliding-window convolution maintains the original tensor layout; it slides a vector window, loads, multiplies, slides, and stores, eliminating im2col expansion and maximizing hardware prefetch locality. For nontrivial filter widths, this fuses sequential loads and FMAs with register shuffles (Snytsar, 2023).
- Winograd transforms: The input, filter, and output transforms are separable. Each small tile is mapped to vector lanes or registers. Transforms are performed with hand-unrolled vector kernels (using SVE/RISC-V intrinsics) that fuse multiply-adds, permutations, and transposes (Gupta et al., 2022).
- Outer-product driven stencils: Scatter-mode stencils map input vector lines and coefficient vectors into an n×n outer product, leveraging hardware instructions that accumulate vector ⊗ vector into a matrix register. Memory patterns are optimized to maximize contiguous accesses and in-register data reuse (Zhao et al., 2023).
- Segmented prefix scan and sum and related: On MMV-RAM, speculative block scan is performed via block-wise matrix-vector (matmul with an upper-triangular matrix), local correction, and recursive composition. Compress and differentiate primitives are chained using matrix multiplications and vector masks for deep, parallel scans (Sobczyk et al., 30 Jun 2025).
- Vector reductions/top-k/elementwise ops: In NPUs (e.g., d-PLENA ISA), primitives such as V_RED_MAX_IDX, V_RED_SUM, V_SUB, V_EXP, V_MUL, V_TOPK_MASK, and masked selection (V_SELECT_INT) are natively supported, with reductions implemented as log-depth reduction trees and all elementwise, mask, and move ops exploiting lane-wise parallelization (Lou et al., 28 Jan 2026).
Table: Representative Non-GEMM Vector Primitives
| Primitive | Core Operation | Typical Application Domain |
|---|---|---|
| DOT, AXPY, SCAL | O(N) vector/elementwise | Linear algebra, ML |
| SlidingWindow | Vector slide + FMA | DNN convolution, signal proc. |
| Winograd | Separable transforms | CNN inference |
| Outer Product | u ⊗ v, accumulate in reg | Stencils, scatter, HPC |
| Segmented Scan | Recursive block-scan+fixup | Parallel prefix-sum, analytics |
| V_RED_SUM etc. | Tree reductions | LLM sampling, softmax |
3. Vector Primitives in Modern Architectures and ISAs
Hardware vector units, NPUs, AI accelerators, and general-purpose CPUs expose non-GEMM primitives as first-class ISA operations or optimized kernels:
- SIMD and vector ISAs (e.g., SSE/AVX/AVX512, ARM-SVE, RISC-V V) map DOT/AXPY/SCAL and sliding-window convolution to fused, aligned memory accesses, and wide-lane FMAs (Progsch et al., 2011, Snytsar, 2023, Gupta et al., 2022).
- Matrix-matrix and outer-product cores: ARM SME and IBM MMA directly expose vector outer products that accumulate two vector registers into one n×n matrix register, making natively-accelerated stencil matrixization feasible (Zhao et al., 2023).
- Custom NPUs (e.g., d-PLENA): The d-PLENA ISA specifies specialized vector reduction, elementwise, and mask primitives for LLM diffusion sampling (max+index, sum, exp, reciprocal, top-k mask, masked select) plus SRAM-move operations. These primitives provide in-place memory reuse, log-depth reduction trees, and hardware-optimized data movement (Lou et al., 28 Jan 2026).
- MMV-RAM: The MMV-RAM model combines an AC[0] vector unit with a small-dimension MMU. Matrix multiplication accelerates block-wise scan, compress, and diff, achieving kernel depth versus the limit for vector-only circuits (Sobczyk et al., 30 Jun 2025).
Hardware units supporting non-GEMM vector primitives often decouple data/control paths, use register-level tiling/unrolling, and rely on banked on-chip SRAM to support the irregular access patterns typical of reductions and scan operations.
4. Theoretical Properties and Lower Bounds
Several research studies provide formal complexity and circuit-theoretic distinctions between GEMM and non-GEMM primitives:
- Circuit complexity separation: Parity and segmented-scan cannot be realized in uniform AC[0], and thus require at least steps for vector-only (AC[0] circuit) implementations. MMV-RAM's use of matrix-multiply units achieves depth scans (Sobczyk et al., 30 Jun 2025).
- Probabilistic guarantees: For high-dimensional associative memories (e.g., HBF), circular convolution/correlation enables -dimensional key-pointer binding with superposition. One-shot decoding provides exponentially small failure probability in under both memory and query noise, provided for stored items (Alpay et al., 28 Jan 2026).
- Optimality in memory traffic: Sliding-window convolution achieves strictly sequential accesses, reduces memory pressure, and, unlike im2col+GEMM, eliminates working-set expansion, leading to optimal use of DRAM bandwidth for large filters (Snytsar, 2023).
- Asymptotic performance: Fused O(N) vector loops for DOT, SCAL, and AXPY with aggressive register blocking, minimal temporaries, and alignment-aware unrolled loops attain the throughput bounds of hand-written microbenchmarks and outperform generic template libraries (Progsch et al., 2011, Singh et al., 2021).
5. Applications and Performance Results
Non-GEMM vector primitives are essential in diverse application domains:
- Deep Learning and CNN Inference: Sliding-window convolution and Winograd transform kernels outperform GEMM-based alternatives, particularly for large 2D filters and wide-vector ISAs, yielding up to end-to-end speedup and significant GFLOPS improvements (Snytsar, 2023, Gupta et al., 2022).
- LLM Sampling and Diffusion Models: Diffusion LLM sampling is governed by vocabulary-wide reductions (max, sum, top-k) and masked/scatter updates, latency-bound by memory bandwidth and reduction trees. The d-PLENA vector ISA achieves up to speedup over GEMM-centric NPUs (Lou et al., 28 Jan 2026).
- Molecular and High-Dimensional Storage: HBF primitives facilitate content-addressable lookup in DNA archives using circular correlation, outperforming pointer-chasing in reliability and access latency under molecular noise (Alpay et al., 28 Jan 2026).
- Stencil Computations and Scientific HPC: Outer-product-based stencil matrixization on hardware outer-product units yields up to speedups over auto-vectorized and temporal vectorization approaches, maximizing in-register coefficient and input vector reuse (Zhao et al., 2023).
- Segmented Operations and Analytics: Segmented scan, sum, compress, and diff realize efficient parallel prefix and reduction patterns for analytics workloads. MMU-augmented MMV-RAM implementations approach steps, closing the performance gap with theoretical lower bounds (Sobczyk et al., 30 Jun 2025).
6. Limitations, Trade-offs, and Future Directions
While non-GEMM primitives deliver substantial benefits, several architectural and algorithmic constraints apply:
- Register and hardware constraints: Tuning unroll factors, register blocking, and tile sizes is required to avoid pipeline stalls and maximize register file utilization. Excessively large vector lengths or insufficient matrix/register units can bottleneck outer-product and scan-based kernels (Progsch et al., 2011, Zhao et al., 2023).
- Memory-bound regimes: For small filter sizes (e.g., ), sliding-window convolution can be memory-bandwidth limited unless fully unrolled custom kernels are used (Snytsar, 2023).
- ISA and data movement constraints: The separation of index, scalar, and vector data in on-chip SRAM (as in d-PLENA) minimizes interference but may complicate programming. Decoupled mixed-precision data paths require explicit mapping for deep learning workloads (Lou et al., 28 Jan 2026).
- Numeric stability: Winograd transforms for large tile sizes may incur numerical instability; empirical tuning of evaluation points is recommended (Gupta et al., 2022).
- Portability and auto-tuning: Performance often depends on careful selection of blocking, tiling, and vectorization parameters. Auto-tuning and hierarchical parallelism are suggested for robust performance across architectures (Singh et al., 2021).
Ongoing and future directions focus on expanding non-GEMM primitive coverage in ISAs, compiler auto-tuning for register-level patterns, integration with model compression/quantization, and further circuit-theoretic analysis of depth/area tradeoffs in emerging accelerator models.
Key References:
- (Progsch et al., 2011)
- (Singh et al., 2021)
- (Snytsar, 2023)
- (Zhao et al., 2023)
- (Sobczyk et al., 30 Jun 2025)
- (Lou et al., 28 Jan 2026)
- (Gupta et al., 2022)
- (Alpay et al., 28 Jan 2026)