Papers
Topics
Authors
Recent
Search
2000 character limit reached

Fused Convolutional Modules (FCMs)

Updated 17 April 2026
  • Fused Convolutional Modules (FCMs) are innovative structures that fuse features through adaptive convolutional operations, enhancing model expressivity and efficiency.
  • They are applied in multi-scale, multi-branch, and attention-enhanced architectures, achieving 1–3 percentage point accuracy improvements with minimal parameter overhead.
  • FCMs enable hardware-efficient inference by merging operations at the kernel level, reducing DRAM traffic by up to 83% and significantly saving energy.

Fused Convolutional Modules (FCMs) are architectural and computational entities within deep neural networks designed to efficiently integrate feature maps or convolutional operations, either by adaptively weighting, compositional fusion, or operator-level merging. FCMs are motivated by two principal objectives: (1) enhancing model expressivity through optimal feature or pathway aggregation, and (2) improving system-level efficiency, particularly in hardware-accelerated inference regimes and bandwidth-constrained environments. The precise instantiation of an FCM varies across system, vision, and hardware contexts but typically involves lightweight convolutional operators (e.g., 1×1 convolutions), attention mechanisms, or direct kernel-level fusion, often with minimal parameter or memory cost.

1. Multi-Scale and Multi-Branch Feature Fusion Using FCMs

In vision architectures, FCMs often operate by fusing multi-scale or multi-branch features extracted at different stages or disparate modalities. The Convolutional Fusion Network (CFN) paradigm applies FCMs as a set of side branches that capture activations at pooling or down-sampling layers. Each side branch consists of a 1×1 convolution followed by global average pooling (GAP). This processing yields a compact KK-dimensional GAP vector per branch:

gk(s)=1H(s)W(s)i=1H(s)j=1W(s)fi,j,k(s)g^{(s)}_k = \frac{1}{H^{(s)}W^{(s)}} \sum_{i=1}^{H^{(s)}} \sum_{j=1}^{W^{(s)}} f^{(s)}_{i,j,k}

These SS branch vectors are concatenated and fused using a locally-connected (per-channel, unshared) fusion layer:

gi(f)=σ(s=1SWi,s(f)gi(s)+bi(f))g^{(f)}_i = \sigma\left(\sum_{s=1}^S W_{i,s}^{(f)} \cdot g^{(s)}_i + b^{(f)}_i\right)

where σ\sigma is a nonlinear activation such as ReLU. This architecture allows per-channel adaptive weighting of sources and demonstrates parameter economy: on CIFAR, total side-branch overhead is 0.074\approx0.074 M parameters, and the LC fusion adds K(S+1)K(S+1) weights (e.g., 640 for K=128K=128, S=4S=4). Empirical results consistently show 1–2% absolute accuracy improvement over parameter-matched CNNs, both on CIFAR-10/100 and on ImageNet (Liu et al., 2016).

In multi-modal or multi-task segmentation networks, FCMs are constructed as chains of 1×1 convolutions acting as shared and private fusion nodes. For example, in multi-branch DeepUNet fusion, FCMs concatenate deep features across modalities (e.g., IRRG, IRGB, DSM), apply a 1×1 convolution to obtain a shared representation, concatenate this with shallow branch-specific features, and apply branch-specific 1×1 convolutions to obtain private features. A final fusion block fuses the private features into per-pixel class predictions (Sun et al., 2018). This block structure ensures both shared and modality-specialized fusion, yielding consistent +2+2 to gk(s)=1H(s)W(s)i=1H(s)j=1W(s)fi,j,k(s)g^{(s)}_k = \frac{1}{H^{(s)}W^{(s)}} \sum_{i=1}^{H^{(s)}} \sum_{j=1}^{W^{(s)}} f^{(s)}_{i,j,k}0 percentage points in segmentation accuracy on large remote sensing benchmarks.

2. Attention-Enhanced FCMs for Feature Aggregation

Hybrid attention mechanisms are introduced in FCMs to enable dynamic weighting at both the channel and spatial levels. In the Attention-Fused Network (AFNet), two principal FCMs are employed:

  • Multipath Attention-Fused Block (MAFB): Operates on concatenated feature maps from main and auxiliary branches (e.g., IRRG and NDVI/DSM). MAFB computes channel-attention (CA) using Squeeze-and-Excitation operations:

gk(s)=1H(s)W(s)i=1H(s)j=1W(s)fi,j,k(s)g^{(s)}_k = \frac{1}{H^{(s)}W^{(s)}} \sum_{i=1}^{H^{(s)}} \sum_{j=1}^{W^{(s)}} f^{(s)}_{i,j,k}1

and spatial-attention (SA) using:

gk(s)=1H(s)W(s)i=1H(s)j=1W(s)fi,j,k(s)g^{(s)}_k = \frac{1}{H^{(s)}W^{(s)}} \sum_{i=1}^{H^{(s)}} \sum_{j=1}^{W^{(s)}} f^{(s)}_{i,j,k}2

The two attention-weighted features are concatenated and projected via 1×1 convolution to the output channel space.

  • Refinement Attention-Fused Block (RAFB): Operates in the decoder, fusing high-level abstract features and low-level spatial features. Using the same CA and SA mechanisms, RAFB applies cross-gating followed by element-wise summation:

gk(s)=1H(s)W(s)i=1H(s)j=1W(s)fi,j,k(s)g^{(s)}_k = \frac{1}{H^{(s)}W^{(s)}} \sum_{i=1}^{H^{(s)}} \sum_{j=1}^{W^{(s)}} f^{(s)}_{i,j,k}3

These modules enhance semantic discrimination and boundary accuracy, with AFNet demonstrating mean F1 increases of up to gk(s)=1H(s)W(s)i=1H(s)j=1W(s)fi,j,k(s)g^{(s)}_k = \frac{1}{H^{(s)}W^{(s)}} \sum_{i=1}^{H^{(s)}} \sum_{j=1}^{W^{(s)}} f^{(s)}_{i,j,k}4 and overall accuracy improvements of up to gk(s)=1H(s)W(s)i=1H(s)j=1W(s)fi,j,k(s)g^{(s)}_k = \frac{1}{H^{(s)}W^{(s)}} \sum_{i=1}^{H^{(s)}} \sum_{j=1}^{W^{(s)}} f^{(s)}_{i,j,k}5 without significant increase in computational cost compared to conventional fusion schemes (Yang et al., 2021).

3. Kernel-Level FCMs for Hardware-Efficient Inference

On computational platforms such as GPUs, FCMs refer to operator fusion at the kernel level. Specifically, back-to-back depthwise (DW) and pointwise (PW) convolutions are merged into a single GPU kernel, eliminating intermediate global memory accesses. Each FCM kernel manages up to five types of shared-memory tiles (input, two filter tiles, communication buffer, output) and utilizes output-stationary/local weight-stationary (OS-LWS) dataflow. Architectural constraints are captured by tiling parameters and shared-memory budgets.

The memory-access reduction model captures DRAM traffic as:

gk(s)=1H(s)W(s)i=1H(s)j=1W(s)fi,j,k(s)g^{(s)}_k = \frac{1}{H^{(s)}W^{(s)}} \sum_{i=1}^{H^{(s)}} \sum_{j=1}^{W^{(s)}} f^{(s)}_{i,j,k}6

where neither the output of the first nor the input of the second convolution is committed to DRAM within the fused execution block. Experimental results show up to gk(s)=1H(s)W(s)i=1H(s)j=1W(s)fi,j,k(s)g^{(s)}_k = \frac{1}{H^{(s)}W^{(s)}} \sum_{i=1}^{H^{(s)}} \sum_{j=1}^{W^{(s)}} f^{(s)}_{i,j,k}7 DRAM-access reduction and up to gk(s)=1H(s)W(s)i=1H(s)j=1W(s)fi,j,k(s)g^{(s)}_k = \frac{1}{H^{(s)}W^{(s)}} \sum_{i=1}^{H^{(s)}} \sum_{j=1}^{W^{(s)}} f^{(s)}_{i,j,k}8 speedup versus cuDNN kernels, along with gk(s)=1H(s)W(s)i=1H(s)j=1W(s)fi,j,k(s)g^{(s)}_k = \frac{1}{H^{(s)}W^{(s)}} \sum_{i=1}^{H^{(s)}} \sum_{j=1}^{W^{(s)}} f^{(s)}_{i,j,k}9 end-to-end inference speedup relative to TVM (Qararyah et al., 2024).

A dedicated tool, FusePlanner, leverages analytic cost models to determine the optimal fusion plan, tiling configuration, and whether fusion is feasible under hardware shared-memory constraints. Energy savings of SS0–SS1 are reported due to DRAM-access reductions.

4. Fused-Layer Dataflow in Near-Memory Architectures

In memory-centric architectures such as near-bank DRAM-PIM, FCMs refer to the execution of multiple consecutive convolution layers “fused” into a single spatial tile computation—fused-layer dataflow. Under the PIMfused model, compiler/runtime systems partition the network into “fused-kernels”, which are sequences of CONV±BN±ReLU±POOL that execute atomically per tile across bank-local PIMcores.

The decoupling of inter-bank dependencies is captured by a transfer saving ratio:

SS2

where SS3 is the number of banks—the larger the bank group tile, the greater the suppression of cross-bank traffic. Empirical measurements on ResNet18 demonstrate that, at optimized buffer sizes (GBUF=32 KB, LBUF=256 B, 4-bank PIMcores), memory cycles are reduced to SS4, system energy to SS5, and die-area to SS6 of a GDDR6-AiM-like baseline (Yang et al., 11 Nov 2025).

This hardware-software FCM instantiation underscores the strategic advantage of spatially local computation on intermediate features, trading small redundancy against massive off-chip transfer savings.

5. Comparative Summary and Performance Impact

A comparative view of key FCM realizations is provided below:

FCM Context Core Principle Gains/Benefits
Multi-scale CFN (Liu et al., 2016) Side-branch fusion + adaptable weighting SS7–SS8 accuracy lift, SS910% param inc., strong transfer
Multimodal UNet (Sun et al., 2018) Shared/private 1×1 conv fusion gi(f)=σ(s=1SWi,s(f)gi(s)+bi(f))g^{(f)}_i = \sigma\left(\sum_{s=1}^S W_{i,s}^{(f)} \cdot g^{(s)}_i + b^{(f)}_i\right)0–gi(f)=σ(s=1SWi,s(f)gi(s)+bi(f))g^{(f)}_i = \sigma\left(\sum_{s=1}^S W_{i,s}^{(f)} \cdot g^{(s)}_i + b^{(f)}_i\right)1 pp accuracy, sharper boundaries
Attention-augmented (Yang et al., 2021) Channel/spatial attention in FCMs gi(f)=σ(s=1SWi,s(f)gi(s)+bi(f))g^{(f)}_i = \sigma\left(\sum_{s=1}^S W_{i,s}^{(f)} \cdot g^{(s)}_i + b^{(f)}_i\right)2–gi(f)=σ(s=1SWi,s(f)gi(s)+bi(f))g^{(f)}_i = \sigma\left(\sum_{s=1}^S W_{i,s}^{(f)} \cdot g^{(s)}_i + b^{(f)}_i\right)3 pp mean F1, gi(f)=σ(s=1SWi,s(f)gi(s)+bi(f))g^{(f)}_i = \sigma\left(\sum_{s=1}^S W_{i,s}^{(f)} \cdot g^{(s)}_i + b^{(f)}_i\right)4 param overhead
GPU DW/PW kernel fusion (Qararyah et al., 2024) Operator fusion, shared-mem dataflow up to gi(f)=σ(s=1SWi,s(f)gi(s)+bi(f))g^{(f)}_i = \sigma\left(\sum_{s=1}^S W_{i,s}^{(f)} \cdot g^{(s)}_i + b^{(f)}_i\right)5 speedup, gi(f)=σ(s=1SWi,s(f)gi(s)+bi(f))g^{(f)}_i = \sigma\left(\sum_{s=1}^S W_{i,s}^{(f)} \cdot g^{(s)}_i + b^{(f)}_i\right)6 DRAM traffic cut, gi(f)=σ(s=1SWi,s(f)gi(s)+bi(f))g^{(f)}_i = \sigma\left(\sum_{s=1}^S W_{i,s}^{(f)} \cdot g^{(s)}_i + b^{(f)}_i\right)7 energy
DRAM-PIM fused-kernel (Yang et al., 11 Nov 2025) Bank-local fused-layer execution gi(f)=σ(s=1SWi,s(f)gi(s)+bi(f))g^{(f)}_i = \sigma\left(\sum_{s=1}^S W_{i,s}^{(f)} \cdot g^{(s)}_i + b^{(f)}_i\right)8 memory-cycle and gi(f)=σ(s=1SWi,s(f)gi(s)+bi(f))g^{(f)}_i = \sigma\left(\sum_{s=1}^S W_{i,s}^{(f)} \cdot g^{(s)}_i + b^{(f)}_i\right)9 energy savings

A general pattern emerges where FCMs, regardless of their domain, provide substantial improvements in either statistical accuracy, energy efficiency, or computational latency with modest increases in model or hardware complexity.

6. Design Principles, Practical Recommendations, and Limitations

Across FCM variants, several design patterns recur:

  • Modular low-dimensional convolutional fusion (1×1 kernels) suffices for strong representational gains while incurring low overhead.
  • Placement of FCMs at all but the earliest pooling/down-sampling layers is recommended (CFN).
  • For multi-branch and multi-modal fusion, explicit construction of shared and private feature paths (with batchnorm/ReLU) is critical.
  • When hardware-level fusion is targeted, analytic models—such as those in FusePlanner or buffer sizing in PIMfused—are essential to maximize feasibility and benefit.

Typical hyperparameters, initialization strategies, and regularization schemes directly transfer from baseline CNN pipelines (Liu et al., 2016, Sun et al., 2018). For operator-level fusion, aggressive use of shared on-chip memory, careful bank-aligned layout, and adaptive tiling produce robust cross-architecture benefits.

Limitations include diminishing returns on extremely deep networks (CFN), overhead in additional buffer resources (PIMfused), and hardware-imposed fusion feasibility (GPU FCMs). Adopting FCMs at design points with excessive layerwise redundancy or in small-scale regimes may yield marginal gains.

7. Application Domains and Transferability

While FCMs originated in multi-scale vision recognition and segmentation (CIFAR, ImageNet, ISPRS), their principles have been successfully extended to:

  • Scene recognition, fine-grained categorization, and image retrieval (CFN: consistent 1–6% accuracy/mAP lift when fused feature is used off-the-shelf).
  • Remote sensing, notably where multimodal fusion or attention-based fine discrimination is critical (DeepUNet multitask, AFNet).
  • Compact models (MobileNet, CeiT, CMT) in resource-constrained inference deployments, with direct hardware-level translation to energy and latency reductions.

A plausible implication is that the central paradigm of FCMs—hierarchical, explicit, and hardware-aware fusion—remains viable, adaptable, and beneficial wherever convolutional feature aggregation or layered operator execution creates performance bottlenecks (Liu et al., 2016, Sun et al., 2018, Yang et al., 2021, Qararyah et al., 2024, Yang et al., 11 Nov 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Fused Convolutional Modules (FCMs).