Winograd Convolution in CNNs

Updated 13 February 2026

Winograd convolution is an algorithmic framework that reduces arithmetic complexity in CNNs by transforming spatial multiplications into more efficient element-wise operations.
It employs specialized transforms based on Vandermonde and polynomial interpolation methods to achieve significant reduction in multiplication counts, e.g., 36 to 16 for F(2×2, 3×3).
Extensions of the approach address numerical stability, quantization, sparsity, and hardware acceleration, making it essential for efficient deep learning inference.

Winograd convolution is an algorithmic framework for reducing the arithmetic complexity of small-size convolutions, especially those appearing in deep convolutional neural networks (CNNs). By transforming both input tiles and convolution kernels into a special algebraic domain—often through polynomials and their evaluations at cleverly chosen points—Winograd convolution replaces most spatial-domain multiplications with less costly element-wise operations and structured additions. Its optimality, widespread use in high-performance deep learning hardware and libraries, and ongoing extensions for quantization, sparsity, and hardware efficiency make it a central technique in efficient AI inference.

1. Mathematical Formulation and Principles

Winograd's minimal filtering algorithm is based on the observation that for 1D convolution with $m$ outputs and an $r$ -tap filter, the minimum required scalar multiplications is $m+r-1$ , compared to the $m \times r$ multiplies of direct convolution. The core procedure generalizes to 2D via tensor and matrix constructions. For a 2D CNN convolution, a $r\times r$ kernel is convolved over tiles of size $m\times m$ (the "output tile") using three fixed linear transforms $B$ , $G$ , and $A$ :

Data transform: $\widetilde{d} = B^T d B$
Weight transform: $\widetilde{g} = G g G^T$
Element-wise multiplication: $M = \widetilde{g} \odot \widetilde{d}$
Inverse transform: $Y = A^T M A$

For $F(m\times m, r\times r)$ , the number of core multiplies per tile is $(m + r - 1)^2$ , yielding an arithmetic reduction factor of $r^2 m^2 / (m + r - 1)^2$ over direct convolution. In the canonical $F(2\times2,3\times3)$ case, this is $36 \to 16$ multiplies per output tile—a 2.25x reduction (Tong et al., 2021, Xue et al., 2022).

The transform matrices $A$ , $B$ , $G$ are usually specialized Vandermonde or Lagrange interpolation forms, derived to ensure that interpolation and de-interpolation—via the Chinese Remainder Theorem for polynomials—preserve exactness for the computed output coefficients (Barabasz et al., 2019, Tong et al., 2021).

2. Algorithmic Extensions and Numerical Stability

Winograd convolution for larger tiles or kernel sizes ( $m, r \gg 2,3$ ) is challenged by numerical instability, as the condition numbers of the Vandermonde-based transforms grow rapidly. This leads to catastrophic amplification of floating-point errors, especially in FP16 or INT8 datatypes (Lohia, 20 Dec 2025). Empirically, the standard integer-point transforms for $F(6,3)$ and $F(8,3)$ become numerically unusable in low precision, destroying inference accuracy (Lohia, 20 Dec 2025).

Recent advances recast Winograd point selection as a continuous numerical optimization problem (NOVA), searching for well-conditioned sets of interpolation points, often simple rationals, that drastically reduce spectral condition numbers of the transforms. For example, $F(8,3)$ configurations discovered via numerical search and symbolic verification achieve up to $415\times$ better conditioning in 1D ( $1.7\times 10^5$ in 2D), restoring full FP16/INT8 accuracy with no retraining (Lohia, 20 Dec 2025). Similar benefits are obtained using symmetric reciprocal points, e.g., $\{\pm c, \pm 1/c\}$ for continuous $c$ , reducing practical errors and yielding up to $59\%$ lower L1 error per layer (Alam et al., 2022).

Error analysis has also motivated mixed-precision schemes in which transforms are computed in higher precision (FP64), followed by lower-precision elementwise multiplies and accumulation, along with canonical summation orderings (e.g., Huffman trees) to minimize rounding error (Barabasz et al., 2018).

3. Sparsity, Pruning, and Structured Acceleration

Standard Winograd transformation destroys spatial sparsity introduced by pruning, since the linear transform fills in zeros. To restore the multiplicative benefits of both, two primary strategies are pursued:

Pruning in the Winograd domain: Directly masking and retraining the sparse parameters after transform, managing the redundancy and overparameterization (as Winograd has more degrees of freedom than spatial filters). Native Winograd-layer pruning achieves up to $90\%$ Winograd-domain sparsity with negligible ( $0.1\%$ ) accuracy loss and $5.4\times$ speed-up over direct convolution (Li et al., 2017).
Spatial-Winograd pruning: First prunes spatial-domain weights in "Winograd-aware" groups, then performs fine-grained, importance-factor-weighted pruning and retraining in the Winograd domain. This achieves 63%, 50%, and 74% Winograd-domain sparsity (CIFAR-10, CIFAR-100, ImageNet) at no loss in accuracy and no architectural change, outperforming baseline Winograd-relu and native sparse Winograd (Yu et al., 2019).

Introduction of ReLU directly into the Winograd domain has been shown to induce substantial activation sparsity, facilitating further multiplier reductions. Winograd-ReLU approaches can realize total arithmetic reductions (including both pruned weights and sparse activations) of $6.8\times$ to $13\times$ for typical CNN/ResNet architectures (Liu et al., 2018, Yu et al., 2019).

4. Quantization and Efficient Integer Implementations

Pure Winograd transforms challenge low-precision hardware due to wide dynamic ranges in transformed activations and weights, particularly for larger tiles. Several key strategies have enabled efficient quantized Winograd inference:

Integer-friendly transform construction: Extensions to the field of complex numbers yield transforms with smaller denominators, thereby reducing intermediate operand widths (e.g., from 18 bits to 12 bits). Conjugate symmetry and per-tile scaling further reduce required precision, achieving $3.13\times$ arithmetic reduction over direct and hardware throughput gains of up to $17.4\%$ over baseline rational transforms (Meng et al., 2019).
LANCE algorithm: Linear per-tensor quantization plus low-precision Winograd domain operations (with zero-point adjustment) supports 8-bit integer inference with only $0.2\%$ loss in ImageNet accuracy and $2.4\times$ convolution speedup (Li et al., 2020).
Tap-wise/group-wise quantization: Assigns independent scaling factors to each tap or small group in the transformed tile, balancing dynamic range and enabling near lossless quantization, especially for larger tiles like $F(4,3)$ or $F(6,3)$ . Tap-wise quantization with learned scaling can recover $>99.5\%$ FP32 accuracy for ResNet-34/ImageNet (Andri et al., 2022). Data-free learning of group scales for Winograd enables calibration-free, fully quantized (INT8) inference for large text-to-image diffusion models and classification CNNs, outperforming standard post-training quantization by $1.6\%$ – $2.6\%$ Top-1 (Pan et al., 2024).
Residue Number System (RNS) Winograd: Implements large-tile Winograd exactly in k parallel residue channels (with CRT reconstruction), achieving up to $7.03\times$ complexity reduction for $5\times5$ kernels and $1.5\times$ – $2.3\times$ measured speedup in INT8/INT16 inference with no accuracy loss (Liu et al., 2020).

5. Large-Kernel, Strided, and Generalized Convolutions

The arithmetic savings of Winograd escalate with kernel and tile size, but direct extensions are numerically or computationally prohibitive. Novel algorithmic generalizations provide practical remedies:

Nested Winograd: Recursively decomposes large-kernel convolution into a hierarchy of small Winograd kernels (akin to Cooley–Tukey FFT). This markedly reduces multiplication counts compared to linear splitting, e.g., for $31\times31$ kernels, yields up to $10.5\times$ reduction over naively applied linear Winograd (Jiang et al., 2021).
Decomposable Winograd Method (DWM): Breaks large/strided kernels into small stride-1 Winograd tiles, enabling universal acceleration for arbitrary kernel size and stride, with $\sim2\times$ FLOP savings and negligible accuracy loss compared to direct methods (Huang et al., 2020).
Beyond Toom–Cook (higher-degree moduli): Polynomial modulus degree is allowed to exceed 1, with nested Chinese Remainder Theory, to build transforms with improved numerical stability and lower operation count, especially for low precision (e.g., fp16, bf16) (Barabasz et al., 2019).

6. Hardware and Systems Implementations

Winograd convolution is a core primitive in modern high-performance DNN libraries (TensorRT, cuDNN, TVM, NNPACK) and is implemented in both CPUs and accelerators (FPGAs, ASICs, DSAs):

Fused transforms and register-level blocking: Integration of input, filter, and output transforms directly into GEMM (via z-shaped layouts or systolic arrays) minimizes transformation overhead, maximizes SIMD/FMA utilization, and yields speedups up to $4.7\times$ over standard libraries like NCNN and NNPACK on ARMv8 CPUs (Gui et al., 2024).
Systolic Winograd accelerators (WinoCNN): Kernel-sharing PE architectures support multiple $(m,k)$ pairs (e.g., $F_4$ , $F_6$ ) with optimized memory pipelining, achieving up to $1.33$ GOPS/DSP and substantial throughput gains over prior FPGA designs (Liu et al., 2021).
DSA integration and custom transform engines: Row-by-row and tap-by-tap Winograd engines within industrial DSAs allow tile-integer-only processing, shift-only rescaling, and double-buffered memory moves, delivering up to $1.85\times$ energy savings and $1.83\times$ throughput gains for SOTA models (Andri et al., 2022).

7. Non-Standard Variants and Extensions

Beyond standard multiplicative convolution, Winograd's framework has been adapted to alternative operations:

AdderNets (ℓ1-norm convolutions): Standard distributivity is not valid for the ℓ1-norm. Modified transform triples $\tilde{A}, \tilde{B}, \tilde{G}$ with balanced sign structure are constructed so that the element-wise operator is absolute difference and accumulation. This reduces additions by $\sim50\%$ versus naive AdderNets and yields $2\times$ energy reduction on FPGA (Li et al., 2021).
Fault-tolerance and resilience: The reduction of multiplications (most sensitive to bit-flips) and increased prevalence of additions in Winograd convolution confers inherent resilience to transient faults. Selective triple modular redundancy (TMR) on Winograd accelerators achieves $27.5\%$ – $70\%$ overhead reductions versus spatial TMR at near-zero accuracy loss, and voltage scaling under Winograd gives up to $7.19\%$ incremental energy savings (Xue et al., 2022).

In summary, Winograd convolution delivers provably minimal, and in practical cases, transformative reductions in convolution arithmetic. Its continuing algorithmic, numeric, and hardware advances have extended its utility to very large tiles, ultra-low-precision inference, structured and unstructured sparsity, large-kernel and strided convolutions, and even alternative nonlinear operations, with robust solutions for numerical stability, quantization, and efficient realization across diverse computing substrates (Lohia, 20 Dec 2025, Tong et al., 2021, Yu et al., 2019, Pan et al., 2024, Li et al., 2017).