Depth-Wise Convolution in FFN

Updated 5 August 2025

Depth-wise convolution in FFNs is a method that applies separate spatial filters to each channel, reducing parameter costs and enhancing local feature extraction.
It integrates into feed-forward networks as a parallel or in-line module, effectively combining local context with global representations.
Empirical results demonstrate notable improvements in accuracy and training speed while significantly reducing computational complexity.

Depth-wise convolution, originally devised to reduce the parameter and computational cost of convolutional neural networks, has become a central architectural component not only in compact CNNs but also in contemporary feed-forward networks (FFNs), including modern Transformer and vision models. The integration of depth-wise convolution into FFN structures exploits the spatially local inductive bias of convolutions while retaining or enhancing computational efficiency, training speed, and sometimes parameter efficiency. Depth-wise mechanisms in FFNs span usage as efficient spatial local filters, dynamic receptive field augmentation, and as plug-in modules to fuse local and global context.

1. Fundamental Concepts: Depth-Wise Convolution in FFNs

Standard convolutional layers apply a set of filters across the full channel dimension of their input, resulting in significant parameter and computation requirements. In contrast, depth-wise convolution decouples the channel-mixing and spatial processing: it applies a separate spatial filter to each input channel independently, with an optional subsequent pointwise (1×1) convolution to mix channels. Within FFN contexts—including MLP-like blocks in vision transformers, hybrid CNN/FFN models, or spatiotemporal architectures—depth-wise convolutions serve to encode spatial context with negligible cross-channel overhead.

In feed-forward network (FFN) settings, especially those assembling MLP or linear projections in high-dimensional latent spaces, depth-wise convolutional modules can be introduced at various locations: directly inside the FFN, as a parallel (bypass) operation, or as a precursor/post-processor for channel-mixing components. This diversity enables models to fuse global representations (from attention or dense FFN layers) with localized spatial or temporal details, effectively bridging the inductive bias gap between globalized transformers and spatially structured CNNs (Zhang et al., 2024).

2. Architectures and Integration Strategies

Table 1: Common Integration Strategies of Depth-Wise Convolution in FFNs

Strategy/Location	Description	Reference
Parallel Shortcut/Bias Injection	Depth-wise conv as a parallel path to FFN output	(Zhang et al., 2024)
Inside FFN (MLP/Projection)	Depth-wise conv placed between linear layers	(Luo et al., 2022)
Pre-Attention/Local Pre-Encoder	Depth-wise conv enriches input before MHSA	(Zhang et al., 2024)
Temporal-Spatial Split (1D/2D/3D)	Depth-wise conv for frame/spatial encoding in 3D/1D	(Heo et al., 2023, Nguy et al., 2023)

The most common recent approach is to insert a lightweight depth-wise convolution module as a shortcut or parallel branch to each FFN or Transformer block. For instance, in (Zhang et al., 2024), the output of an FFN is reshaped from 1D tokens into a 2D grid, passed through activation and normalization, then through a depth-wise convolution. The convolved output is reshaped back to 1D and summed with the FFN output, fusing local and global representations. Alternative approaches (e.g., (Luo et al., 2022)) integrate a 3×3 depth-wise convolution inside the FFN structure, jointly with channel attention, yielding a channel-enhanced FFN (CEFN).

Some frameworks use depth-wise separable convolutions for both efficiency and rotational or translation invariance, especially in contexts where data variety is high or hardware constraints are significant (Fuhl et al., 2020, Kumari et al., 2022, Nguy et al., 2023, Ye et al., 2018).

3. Computational Efficiency and Parameter Trade-Offs

Depth-wise convolutions drastically reduce the complexity and number of parameters relative to standard convolutions. The baseline parameter cost for a standard convolution layer with $N$ filters of shape $P_k \times P_k \times M$ is:

$O_{\text{std}} = N \cdot P_p^2 \cdot (P_k^2 \cdot M)$

For a depth-wise separable convolution, the parameter cost splits as:

$O_{\text{dw}} = M \cdot P_p^2 \cdot (P_k^2 + N)$

This reduction becomes particularly significant when $M$ (number of input channels) is large and $P_k$ is moderate. In the 3D context (Ye et al., 2018), comparable reductions (often over 10×) are achieved on volumetric data. Applied within FFN structures (or blocks with large expansion factors), such reduction can make hybrid transformer–CNN architectures or deep vision models feasible on limited resources (Luo et al., 2022, Kumari et al., 2022, Nguy et al., 2023).

From a computational standpoint, the diagonalwise refactorization technique enables the execution of all depth-wise convolutions via a single standard convolution backed by highly optimized libraries (e.g., cuDNN), realizing between 1.4× and 15.4× training speedup across various frameworks (Qin et al., 2018).

4. Performance, Inductive Bias, and Training Dynamics

Vanilla FFN and Transformer models lack the spatial inductive bias inherent to convolutions, resulting in comparatively slow convergence and lower accuracy on small datasets. Augmenting FFNs with depth-wise convolutions injects a strong locality bias, improving representation capacity for local details (edges, textures, local dependencies) without sacrificing the long-range modeling enabled by self-attention or channel-mixing.

Empirical results show substantial gains. In (Zhang et al., 2024), integrating a depth-wise convolution module as a shortcut in ViT models boosts CIFAR-10/100 accuracy, for ViT-Tiny improving from 94.01% to 96.41%, and provides a +4% margin on CIFAR-100, with minimal parameter overhead (0.023M extra). Similar improvements are seen on ImageNet and COCO (detection/segmentation). In low-level vision, adding depth-wise convolutions to FFNs (as in the channel-enhanced FFN, CEFN) improves dehazing performance (higher PSNR/SSIM) while reducing model size by an order of magnitude (Luo et al., 2022).

The parallel or bypass design (where the depth-wise block is summed with the FFN output) accelerates convergence, as observed in training on small datasets (Zhang et al., 2024). In tasks requiring rotational invariance or rapid adaptation to spatial variation, depth-wise convolution as a separable component (possibly with radial parameterization) boosts invariance while preserving efficiency (Fuhl et al., 2020).

5. Variants and Specialized Contexts

Several variants exist for FFN integration:

Grouped or Parallel Depth-Wise Convolutions: Input split into groups, each processed by a depth-wise conv (with possibly different kernel sizes), then recombined (Zhang et al., 2024, Luo et al., 2022).
Depth-Wise plus Dilated Convolution: Large effective kernel is constructed by composing a regular 3×3 depth-wise convolution with a depth-wise dilated convolution, mimicking large receptive fields at low cost (Luo et al., 2022).
Frequency-Domain Pruning: In depth-wise separable architectures where pointwise layers dominate, further efficiency is achieved by transforming activations to the frequency domain, selectively pruning higher-frequency coefficients per channel using a learned threshold and regularization (Buckler et al., 2021).
Radial Weight Sharing: For tasks with inherent rotational symmetry, depth-wise convolutions with radially indexed weights promote invariance by enforcing rotationally symmetric filters (Fuhl et al., 2020).
Temporal-Spatial Split: For spatiotemporal tasks, depth-wise convolution is used for temporal (multi-scale 1D) or spatial (3D) encoding, followed by a frame-wise (channel) FFN with channel normalization, enabling efficient and selective feature propagation (Heo et al., 2023, Nguy et al., 2023).

6. Applications Beyond Vision: Text, Speech, and Spatiotemporal Processing

While image classification, detection, and segmentation are the dominant applications, depth-wise convolutional FFN blocks have been deployed in handwritten text recognition (Kumari et al., 2022), speaker verification (Heo et al., 2023), and eye blinking detection (Nguy et al., 2023). In all cases, the common goal is to shrink model size and inference time while maintaining discriminative power, especially under domain-specific spatiotemporal constraints.

Integration with recurrent or gated convolutional components (e.g., Bi-GRUs, gated CNNs) leverages the local representations produced by depth-wise convolutions as primitives for sequential modeling. In dynamic, multi-scale, or multi-branch pyramidal architectures, depth-wise operators further enable fine-grained feature extraction at drastically reduced computational burden (Nguy et al., 2023).

7. Limitations, Challenges, and Future Directions

Depth-wise convolutions, while efficient, reduce spatial parameterization flexibility, potentially limiting expressivity if overused or lacking complementary channel-mixing (Fuhl et al., 2020, Luo et al., 2022).
The success of network decoupling—approximating regular convolutions by a sum of depth-wise separable convolutions—critically depends on redundancy in the original network. Aggressive decoupling may require fine-tuning to recover accuracy (Guo et al., 2018).
Integration into transformer-based FFNs as a parallel path is lightweight and effective for plug-and-play usage; however, optimal design (e.g., placement, kernel size, frequency, parallelization) remains an open research question (Zhang et al., 2024).
In transfer learning or large-scale pre-training, introducing local inductive bias through depth-wise convolution modules may alter the downstream performance dynamics, warranting further investigation.

Summary Table: Depth-Wise Convolution in FFN Context

Aspect	Depth-Wise Convolution Approach	Representative Source
Efficiency	Dramatic reduction in params/FLOPs; GPU-optimized	(Qin et al., 2018, Ye et al., 2018)
Local Bias	Enhances capture of fine spatial/temporal details	(Zhang et al., 2024, Luo et al., 2022)
FFN Integration	Plug-and-play shortcut, inside-FFN, or hybrid	(Zhang et al., 2024, Luo et al., 2022, Heo et al., 2023)
Expressivity	May need pointwise/channel-mixing for full power	(Guo et al., 2018, Fuhl et al., 2020)
Performance Gains	Improved accuracy, faster convergence, small ΔParams	(Zhang et al., 2024, Kumari et al., 2022)
Specialized Tasks	3D, time series, invariance-need tasks	(Ye et al., 2018, Heo et al., 2023, Fuhl et al., 2020)

Depth-wise convolution in FFNs unifies convolutional and transformer paradigms, providing a tractable mechanism for fusing spatial and global features across a variety of architectures and domains. Its precise role, optimal placement, and broader impact remain active areas of exploration in efficient deep model design.