Papers
Topics
Authors
Recent
Search
2000 character limit reached

Large-Kernel Depthwise Convolution

Updated 8 February 2026
  • Large-Kernel Depthwise Convolution is a convolution operator that applies a separate large spatial filter per channel, effectively increasing the receptive field while preserving parameter efficiency.
  • It employs advanced techniques such as structural re-parameterization, kernel factorization, and dilated convolutions to mitigate optimization challenges and reduce computation costs.
  • Practical applications include improved image segmentation, object detection, speech enhancement, and 3D medical imaging, demonstrating significant gains in accuracy and robustness.

Large-kernel depthwise convolution is a class of convolutional operator in which a separate large spatial filter (kernel size K×KK \times K, with K≫3K \gg 3) is applied independently to each channel of an input tensor. This operator simultaneously expands the effective receptive field of the network and preserves parameter and compute efficiency compared to dense large-kernel convolutions. In modern convolutional neural networks (CNNs), large-kernel depthwise convolutions are critical for capturing global spatial context, mimicking long-range dependencies found in vision transformers, and enabling high accuracy in dense prediction tasks, mobile inference, and specialized domains such as speech or 3D medical imaging.

1. Mathematical Formulation and Scaling Properties

Given an input feature map X∈RC×H×WX \in \mathbb{R}^{C \times H \times W} and a per-channel kernel W∈RC×K×KW \in \mathbb{R}^{C \times K \times K}, the depthwise convolution produces output Y∈RC×H×WY \in \mathbb{R}^{C \times H \times W} via: Yi,j(c)=∑u=0K−1∑v=0K−1Xi+u, j+v(c)  Wu,v(c)Y^{(c)}_{i,j} = \sum_{u=0}^{K-1} \sum_{v=0}^{K-1} X^{(c)}_{i+u,\,j+v} \; W^{(c)}_{u,v} The parameter count is C K2C\,K^2, and FLOPs for stride-1 inference are C K2 H WC\,K^2\,H\,W. Dilation dd modifies the effective kernel support to ((K−1)d+1)((K-1)d + 1) pixels in each spatial dimension but does not change the arithmetic count: Yi,j(c)=∑u=0K−1∑v=0K−1Xi+ud, j+vd(c)  Wu,v(c)Y^{(c)}_{i,j} = \sum_{u=0}^{K-1} \sum_{v=0}^{K-1} X^{(c)}_{i+u d,\,j+v d} \; W^{(c)}_{u,v} For 3D data, as in volumetric segmentation, the natural extension

Yb,c,i,j,k=∑u=0Kd−1∑v=0Kh−1∑w=0Kw−1Xb,c,i+u,j+v,k+w Wc,0,u,v,wY_{b,c,i,j,k} = \sum_{u=0}^{K_d-1} \sum_{v=0}^{K_h-1} \sum_{w=0}^{K_w-1} X_{b,c,i+u,j+v,k+w} \, W_{c,0,u,v,w}

scales parameter count as C KdKhKwC\,K_d K_h K_w and FLOPs accordingly (Lee et al., 26 May 2025).

2. Motivations and Effects of Large Spatial Kernels

Large kernels directly expand the receptive field, which is otherwise only slowly enlarged through deep stacking of small (e.g., 3×33\times3) convolutions. Theoretical analyses show that the effective receptive field (ERF) radius in a stack of LL k×kk \times k layers grows as O(kL)\mathcal{O}\bigl(k\sqrt{L}\bigr), so increasing kernel size yields linear ERF gains while adding depth offers only a sublinear effect (Ding et al., 2022). Empirical studies demonstrate that, for moderate KK (up to 31×3131 \times 31 in RepLKNet), large-kernel depthwise convolutions confer:

  • Substantially larger ERFs and increased shape bias (reducing reliance on local texture cues),
  • Improved accuracy in dense prediction, segmentation, and classification,
  • Robustness to small-object detection and image corruptions,
  • Significant computational efficiency when compared to full (regular) convolutions with the same receptive field.

However, very large single kernels (K≳11K \gtrsim 11) can initially improve accuracy, but beyond a threshold, model performance may degrade due to excessive parameterization and optimization instability unless mitigated by appropriate architectural choices (e.g., kernel mixing, separability, re-parameterization) (Tan et al., 2019, Lau et al., 2023).

3. Architectural Strategies and Variants

Multiple high-impact strategies are employed to amortize the quadratic cost and optimize the representational capacity of large-kernel depthwise convolutions:

a. Direct Large-Kernel Depthwise Convolutions

Directly instantiate K×KK \times K depthwise filters for each channel. This approach is feasible for moderate KK (5≤K≤315\leq K\leq31) in modern hardware, particularly if batch normalization and structural re-parameterization are employed to ease training and inference (Ding et al., 2022, Chen et al., 2021).

b. Structural Re-parameterization

Auxiliary branches (e.g., 3×33\times3 or identity) are added during training to ease optimization, followed by merging all branches into a single K×KK\times K kernel at inference. This preserves the expressivity and stability of large filters with no run-time penalty (Ding et al., 2022, Lee et al., 26 May 2025).

c. Kernel Factorization and Separable Approximations

To further reduce parameter and compute cost, large 2D kernels are decomposed into cascaded or parallel 1D filters (e.g., K×1K\times1 and 1×K1\times K) or by spatially separable convolutional blocks. Approaches like XSepConv (Chen et al., 2020) and LSKA (Lau et al., 2023) replace a K×KK\times K kernel by (K×1)+(1×K)(K\times1)+(1\times K) convolutions, augmented with 2×22\times2 or 3×33\times3 local convolutions to capture diagonals and local interactions, achieving an overall parameter count of O(KC)O(KC) instead of O(K2C)O(K^2C).

Strategy Parameter Count FLOPs
Direct DWConv CK2C K^2 CK2HWC K^2 H W
XSepConv (2D) (2K+4)C(2K+4)C (2K+4)CHW(2K+4)C H W
LSKA $4 (K/2) C$ $4 (K/2) C H W$
MixConv ∑tctkt2\sum_t c_t k_t^2 ∑tctkt2HW\sum_t c_t k_t^2 H W

d. Kernel Mixing and Pyramid Modules

PydMobileNet and MixConv (Hoang et al., 2018, Tan et al., 2019) employ multiple parallel depthwise filters of varying sizes (3×33\times3, 5×55\times5, 7×77\times7, 9×99\times9) across the channel dimension, followed by either fusion (addition or concatenation) and a 1×11\times1 pointwise convolution. This explicitly aggregates multi-scale spatial features within a block and increases ERF diversity while maintaining efficiency.

e. Decomposed and Dilated Depthwise Kernels

DLKCB (Luo et al., 2022) and similar modules split a large K×KK\times K filter into cascaded smaller depthwise convolutions, one of which is dilated to maintain an equivalent ERF, dramatically reducing parameters and compute while retaining large spatial context.

f. Inception-Style and Anisotropic Designs

Inception Depthwise Convolution (IDConv) (Tang et al., 18 Nov 2025) dissects a K×KK\times K filter into parallel 3×33\times3, 1×K1\times K, K×1K\times 1, and identity subbranches, allocating capacity for both local and highly anisotropic patterns (e.g., in speech spectrograms).

4. Application Domains and Empirical Impact

Large-kernel depthwise convolution is deployed in image classification, detection, segmentation, dehazing, speech enhancement, and volumetric (3D) medical imaging.

  • Vision tasks: RepLKNet (Ding et al., 2022), using depthwise kernels up to 31×3131\times31, matches or surpasses Swin Transformer on ImageNet (84.8%84.8\% Top-1, 31×3131\times31 kernels), and achieves higher segmentation mIoU on ADE20K (RepLKNet-XL: 56.0%56.0\%) with better inference speed and lower resource usage than ViTs.
  • Mobile & lightweight models: MixNet models using MixConv blocks outperform MobileNetV2 and other AutoML-discovered models on ImageNet (+4.2% top-1; up to 78.9%78.9\% at $565$M FLOPs) and COCO detection tasks, with parameter-efficient multi-scale receptive fields (Tan et al., 2019).
  • Object detection: DSLK-Block in YOLO-Ant (Tang et al., 2024) (kernel sizes up to 27×2727\times27 via depthwise separable convolutions) yields an  11%~11\% improvement in small object detection mAP on COCO and valuable power savings in a lightweight detector.
  • Speech enhancement: IMSE [(Tang et al., 18 Nov 2025), IDConv] achieves state-of-the-art PESQ scores with a 16.8%16.8\% parameter reduction vs. MUSE, exploiting large anisotropic receptive fields matched to spectrotemporal structure.
  • 3D medical imaging: Rep3D (Lee et al., 26 May 2025) leverages 21×21×2121\times21\times21 depthwise kernels with spatially adaptive optimization, achieving top Dice scores across KiTS19, MSD Pancreas, AMOS22, and surpassing transformer-based 3D segmenters.
  • Image dehazing: LKD-Net (Luo et al., 2022) uses decomposed depthwise large kernels (DLKCB with dilation technique) to outperform convolutional and transformer networks on SOTS, requiring only 1.79%1.79\% of the parameters and 48.9%48.9\% of the FLOPs compared to Dehamer.
  • Robustness and shape bias: Large kernels in attention or convolution increase a model's invariance to local corruption, improve retention of object shape features, and reduce overfitting to local textures (Lau et al., 2023, Ding et al., 2022).

5. Hardware and Implementation Considerations

Efficient hardware support is imperative due to the high bandwidth, memory footprint, and computational intensity of naïve large-kernel convolutions. Practical accelerators decouple kernel size scalability from internal parallelism—e.g., the architecture in (Chen et al., 2021) achieves 100%100\% MAC utilization up to K=7K=7 in depthwise mode by mapping each channel to a dedicated PE, scaling compute linearly in K2K^2, and supporting arbitrary dilation without additional hardware.

Empirical comparisons show that specialized dataflow (address generators, on-chip SRAMs for features/weights, programmable PE utilization) enables real-time inference for real-world backbones like RetinaFace at VGA resolution, raising throughput by 20%20\% and reducing model size by 20%20\% via DDC-type layers (Chen et al., 2021). Such architectures also outperform prior designs (e.g., Light-OPU, Su et al.) in MAC utilization when K>3K>3.

6. Design Trade-offs and Best Practices

Key considerations when deploying large-kernel depthwise convolutions are:

  • Parameter and compute trade-off: Direct K×KK\times K depthwise convolutions scale quadratically, but decomposed, separable, or inception-style architectures can achieve comparable or better representational power with parameter and FLOP reductions of 5×5\times–20×20\times or more (Chen et al., 2020, Tang et al., 18 Nov 2025, Luo et al., 2022, Lau et al., 2023).
  • Optimization: Structural re-parameterization and identity shortcuts mitigate vanishing gradients and improve convergence with large kernels (Ding et al., 2022, Lee et al., 26 May 2025).
  • Channel splitting and mixing: MixConv and PydDWConv partition channels over kernels of different sizes, which can further balance computational efficiency and ERF diversity (Hoang et al., 2018, Tan et al., 2019).
  • Dilation vs. true kernel size: Dilation provides a partial substitute for large KK, but direct large kernels generally offer superior empirical results for the same parameter budget (Tan et al., 2019, Luo et al., 2022).
  • Task specificity: For highly anisotropic data (e.g., speech), strip-wise or inception decompositions best exploit parameter budgets (Tang et al., 18 Nov 2025).
  • Hardware alignment: Choosing a scheme (e.g., XSepConv, LSKA, decomposed/dilated depthwise) that aligns with accelerator dataflow and memory constraints is crucial for on-device deployment.

7. Representative Architectures and Empirical Outcomes

Architecture / Study Max Kernel Size / Type Peak Top-1 / key metric Parameter reduction vs. naive Domain(s) Reference
RepLKNet 31×3131 \times 31, direct + reparam 84.8% (INet), 56% mIoU 10.4% increase over 3×33\times3 Vision/general (Ding et al., 2022)
MixNet (MixConv) 9×99 \times 9, channel splitting 78.9% (INet) 50% vs. naive 9×99\times9 Image, detection (Tan et al., 2019)
XSepConv $5-7$, spatially separable +0.27% Top-1 (C10) 44% vs. 5×55\times5 Mobile vision (Chen et al., 2020)
DSLK-Block (YOLO-Ant) $5$–$27$, DW-sep, skip branch +11.3% mAP (small obj) ~190× vs. 27×2727\times27 Detection (Tang et al., 2024)
LKD-Net (DLKCB) 21×2121\times21, decomposed/dilated Best mIoU (SOTS) 78% vs. 21×2121\times21 Dehazing (Luo et al., 2022)
Rep3D 21×21×2121\times21\times21, 3D DW + mask Top Dice (0.910 AMOS-CT) Efficient per-ERF Volumetric segment. (Lee et al., 26 May 2025)
IMSE (IDConv) 11×1111\times11, inception/strip-style 3.373 PESQ (speech) 16.8% vs. MUSE Speech enhancement (Tang et al., 18 Nov 2025)
VAN-LSKA up to 65×6565\times65, separable LKA 75–83% Top-1 50% in kernel param Vision, robust det. (Lau et al., 2023)

Empirical gains reinforce large-kernel depthwise convolution's role as a central primitive in modern efficient CNN and hybrid-ViT architectures, unleashing large-ERF modeling at modest cost when designed with separability, reparam, and context-specific decomposition.


References:

(Chen et al., 2021, Ding et al., 2022, Tan et al., 2019, Hoang et al., 2018, Chen et al., 2020, Lau et al., 2023, Luo et al., 2022, Lee et al., 26 May 2025, Tang et al., 18 Nov 2025, Tang et al., 2024)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Large-Kernel Depthwise Convolution.