Fused Convolutional Modules (FCMs)
- Fused Convolutional Modules (FCMs) are innovative structures that fuse features through adaptive convolutional operations, enhancing model expressivity and efficiency.
- They are applied in multi-scale, multi-branch, and attention-enhanced architectures, achieving 1–3 percentage point accuracy improvements with minimal parameter overhead.
- FCMs enable hardware-efficient inference by merging operations at the kernel level, reducing DRAM traffic by up to 83% and significantly saving energy.
Fused Convolutional Modules (FCMs) are architectural and computational entities within deep neural networks designed to efficiently integrate feature maps or convolutional operations, either by adaptively weighting, compositional fusion, or operator-level merging. FCMs are motivated by two principal objectives: (1) enhancing model expressivity through optimal feature or pathway aggregation, and (2) improving system-level efficiency, particularly in hardware-accelerated inference regimes and bandwidth-constrained environments. The precise instantiation of an FCM varies across system, vision, and hardware contexts but typically involves lightweight convolutional operators (e.g., 1×1 convolutions), attention mechanisms, or direct kernel-level fusion, often with minimal parameter or memory cost.
1. Multi-Scale and Multi-Branch Feature Fusion Using FCMs
In vision architectures, FCMs often operate by fusing multi-scale or multi-branch features extracted at different stages or disparate modalities. The Convolutional Fusion Network (CFN) paradigm applies FCMs as a set of side branches that capture activations at pooling or down-sampling layers. Each side branch consists of a 1×1 convolution followed by global average pooling (GAP). This processing yields a compact -dimensional GAP vector per branch:
These branch vectors are concatenated and fused using a locally-connected (per-channel, unshared) fusion layer:
where is a nonlinear activation such as ReLU. This architecture allows per-channel adaptive weighting of sources and demonstrates parameter economy: on CIFAR, total side-branch overhead is M parameters, and the LC fusion adds weights (e.g., 640 for , ). Empirical results consistently show 1–2% absolute accuracy improvement over parameter-matched CNNs, both on CIFAR-10/100 and on ImageNet (Liu et al., 2016).
In multi-modal or multi-task segmentation networks, FCMs are constructed as chains of 1×1 convolutions acting as shared and private fusion nodes. For example, in multi-branch DeepUNet fusion, FCMs concatenate deep features across modalities (e.g., IRRG, IRGB, DSM), apply a 1×1 convolution to obtain a shared representation, concatenate this with shallow branch-specific features, and apply branch-specific 1×1 convolutions to obtain private features. A final fusion block fuses the private features into per-pixel class predictions (Sun et al., 2018). This block structure ensures both shared and modality-specialized fusion, yielding consistent to 0 percentage points in segmentation accuracy on large remote sensing benchmarks.
2. Attention-Enhanced FCMs for Feature Aggregation
Hybrid attention mechanisms are introduced in FCMs to enable dynamic weighting at both the channel and spatial levels. In the Attention-Fused Network (AFNet), two principal FCMs are employed:
- Multipath Attention-Fused Block (MAFB): Operates on concatenated feature maps from main and auxiliary branches (e.g., IRRG and NDVI/DSM). MAFB computes channel-attention (CA) using Squeeze-and-Excitation operations:
1
and spatial-attention (SA) using:
2
The two attention-weighted features are concatenated and projected via 1×1 convolution to the output channel space.
- Refinement Attention-Fused Block (RAFB): Operates in the decoder, fusing high-level abstract features and low-level spatial features. Using the same CA and SA mechanisms, RAFB applies cross-gating followed by element-wise summation:
3
These modules enhance semantic discrimination and boundary accuracy, with AFNet demonstrating mean F1 increases of up to 4 and overall accuracy improvements of up to 5 without significant increase in computational cost compared to conventional fusion schemes (Yang et al., 2021).
3. Kernel-Level FCMs for Hardware-Efficient Inference
On computational platforms such as GPUs, FCMs refer to operator fusion at the kernel level. Specifically, back-to-back depthwise (DW) and pointwise (PW) convolutions are merged into a single GPU kernel, eliminating intermediate global memory accesses. Each FCM kernel manages up to five types of shared-memory tiles (input, two filter tiles, communication buffer, output) and utilizes output-stationary/local weight-stationary (OS-LWS) dataflow. Architectural constraints are captured by tiling parameters and shared-memory budgets.
The memory-access reduction model captures DRAM traffic as:
6
where neither the output of the first nor the input of the second convolution is committed to DRAM within the fused execution block. Experimental results show up to 7 DRAM-access reduction and up to 8 speedup versus cuDNN kernels, along with 9 end-to-end inference speedup relative to TVM (Qararyah et al., 2024).
A dedicated tool, FusePlanner, leverages analytic cost models to determine the optimal fusion plan, tiling configuration, and whether fusion is feasible under hardware shared-memory constraints. Energy savings of 0–1 are reported due to DRAM-access reductions.
4. Fused-Layer Dataflow in Near-Memory Architectures
In memory-centric architectures such as near-bank DRAM-PIM, FCMs refer to the execution of multiple consecutive convolution layers “fused” into a single spatial tile computation—fused-layer dataflow. Under the PIMfused model, compiler/runtime systems partition the network into “fused-kernels”, which are sequences of CONV±BN±ReLU±POOL that execute atomically per tile across bank-local PIMcores.
The decoupling of inter-bank dependencies is captured by a transfer saving ratio:
2
where 3 is the number of banks—the larger the bank group tile, the greater the suppression of cross-bank traffic. Empirical measurements on ResNet18 demonstrate that, at optimized buffer sizes (GBUF=32 KB, LBUF=256 B, 4-bank PIMcores), memory cycles are reduced to 4, system energy to 5, and die-area to 6 of a GDDR6-AiM-like baseline (Yang et al., 11 Nov 2025).
This hardware-software FCM instantiation underscores the strategic advantage of spatially local computation on intermediate features, trading small redundancy against massive off-chip transfer savings.
5. Comparative Summary and Performance Impact
A comparative view of key FCM realizations is provided below:
| FCM Context | Core Principle | Gains/Benefits |
|---|---|---|
| Multi-scale CFN (Liu et al., 2016) | Side-branch fusion + adaptable weighting | 7–8 accuracy lift, 910% param inc., strong transfer |
| Multimodal UNet (Sun et al., 2018) | Shared/private 1×1 conv fusion | 0–1 pp accuracy, sharper boundaries |
| Attention-augmented (Yang et al., 2021) | Channel/spatial attention in FCMs | 2–3 pp mean F1, 4 param overhead |
| GPU DW/PW kernel fusion (Qararyah et al., 2024) | Operator fusion, shared-mem dataflow | up to 5 speedup, 6 DRAM traffic cut, 7 energy |
| DRAM-PIM fused-kernel (Yang et al., 11 Nov 2025) | Bank-local fused-layer execution | 8 memory-cycle and 9 energy savings |
A general pattern emerges where FCMs, regardless of their domain, provide substantial improvements in either statistical accuracy, energy efficiency, or computational latency with modest increases in model or hardware complexity.
6. Design Principles, Practical Recommendations, and Limitations
Across FCM variants, several design patterns recur:
- Modular low-dimensional convolutional fusion (1×1 kernels) suffices for strong representational gains while incurring low overhead.
- Placement of FCMs at all but the earliest pooling/down-sampling layers is recommended (CFN).
- For multi-branch and multi-modal fusion, explicit construction of shared and private feature paths (with batchnorm/ReLU) is critical.
- When hardware-level fusion is targeted, analytic models—such as those in FusePlanner or buffer sizing in PIMfused—are essential to maximize feasibility and benefit.
Typical hyperparameters, initialization strategies, and regularization schemes directly transfer from baseline CNN pipelines (Liu et al., 2016, Sun et al., 2018). For operator-level fusion, aggressive use of shared on-chip memory, careful bank-aligned layout, and adaptive tiling produce robust cross-architecture benefits.
Limitations include diminishing returns on extremely deep networks (CFN), overhead in additional buffer resources (PIMfused), and hardware-imposed fusion feasibility (GPU FCMs). Adopting FCMs at design points with excessive layerwise redundancy or in small-scale regimes may yield marginal gains.
7. Application Domains and Transferability
While FCMs originated in multi-scale vision recognition and segmentation (CIFAR, ImageNet, ISPRS), their principles have been successfully extended to:
- Scene recognition, fine-grained categorization, and image retrieval (CFN: consistent 1–6% accuracy/mAP lift when fused feature is used off-the-shelf).
- Remote sensing, notably where multimodal fusion or attention-based fine discrimination is critical (DeepUNet multitask, AFNet).
- Compact models (MobileNet, CeiT, CMT) in resource-constrained inference deployments, with direct hardware-level translation to energy and latency reductions.
A plausible implication is that the central paradigm of FCMs—hierarchical, explicit, and hardware-aware fusion—remains viable, adaptable, and beneficial wherever convolutional feature aggregation or layered operator execution creates performance bottlenecks (Liu et al., 2016, Sun et al., 2018, Yang et al., 2021, Qararyah et al., 2024, Yang et al., 11 Nov 2025).