Papers
Topics
Authors
Recent
Search
2000 character limit reached

DFC Attention: Efficient Global Attention

Updated 25 January 2026
  • DFC Attention is a lightweight global attention mechanism that decouples dense self-attention into two orthogonal, fully-connected operations.
  • It efficiently models global dependencies in CNNs by using horizontal and vertical passes, significantly reducing computational and memory costs.
  • DFC Attention demonstrates superior accuracy and efficiency across tasks like classification, detection, and medical image analysis in resource-constrained environments.

Decoupled Fully-Connected (DFC) Attention is a lightweight global attention mechanism introduced to address the need for efficient, hardware-friendly non-local feature modeling within convolutional neural networks, particularly for resource-constrained scenarios such as mobile vision tasks. DFC attention achieves global receptive field coverage by factorizing dense self-attention into two orthogonal, parameter-efficient transformations—horizontal and vertical fully-connected operations—dramatically reducing computational and memory cost compared to canonical self-attention, while outperforming standard channel or spatial attention modules in practice (Tang et al., 2022, Hu et al., 2023).

1. Motivation and Conceptual Foundation

DFC attention was motivated by the observation that purely convolutional operations, while computationally cheap, are inherently local, with a limited ability to capture long-range dependencies. Standard self-attention used in transformer-style models enables global context modeling but incurs an O((H·W)²·C) computational and memory overhead that is infeasible for embedded and edge devices. Channel-only mechanisms (e.g., Squeeze-and-Excitation, SE), while lightweight, cannot encode spatial dependencies. DFC attention provides a sparse, separable approximation of full self-attention, using two learnable fully-connected mappings—one along the width (horizontal) and one along the height (vertical)—to propagate contextual information globally at a fraction of the computational cost (Tang et al., 2022, Hu et al., 2023).

2. Mathematical Formulation and Algorithm

Given an input tensor X∈RH×W×CX \in \mathbb{R}^{H \times W \times C} representing a 2D feature map with CC channels, DFC attention computes an output YY as follows:

  1. Horizontal Fully-Connected Pass:
    • For each row hh, Bh,:,:B_{h,:,:} is computed as Xh,:,:X_{h,:,:} multiplied by a learned width-wise matrix Ow∈RW×WO_w \in \mathbb{R}^{W \times W}:

    Bh,w,c=∑w′=1WXh,w′,c⋅[Ow]w′,wB_{h, w, c} = \sum_{w'=1}^W X_{h, w', c} \cdot [O_w]_{w', w}

  2. Vertical Fully-Connected Pass:

    • For each column ww, M:,w,:M_{:,w,:} is computed as B:,w,:B_{:, w, :} multiplied by a learned height-wise matrix Oh∈RH×HO_h \in \mathbb{R}^{H \times H}:

    Mh,w,c=∑h′=1HBh′,w,c⋅[Oh]h′,hM_{h, w, c} = \sum_{h'=1}^H B_{h', w, c} \cdot [O_h]_{h', h}

  3. Nonlinearity and Gating:

    • Apply element-wise sigmoid to produce the attention mask A=σ(M)A = \sigma(M).
    • The final output is computed as Y=X∘AY = X \circ A, with ∘\circ denoting Hadamard (element-wise) multiplication.

This decomposition enables global interaction among all pixels, yet only requires H2+W2H^2 + W^2 parameters per block (for OhO_h, OwO_w), compared to O((Hâ‹…W)2)O((H \cdot W)^2) for dense self-attention.

Pseudocode excerpt (Hu et al., 2023):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
def DFC_Attention(X): # X: [H, W, C]
    H, W, C = X.shape
    # Horizontal pass
    B = np.zeros_like(X)
    for h in range(H):
        for c in range(C):
            B[h, :, c] = np.dot(X[h, :, c], O_w)
    # Vertical pass
    M = np.zeros_like(B)
    for w in range(W):
        for c in range(C):
            M[:, w, c] = np.dot(B[:, w, c], O_h)
    # Sigmoid gating
    A = sigmoid(M)
    Y = X * A
    return Y

3. Integration within Network Architectures

DFC attention is typically embedded within inverted-residual or "Ghost" bottleneck modules, as seen in GhostNet V2 and other lightweight vision backbones:

  • The expansion stage (1×1 convolution plus depth-wise convolutions) produces intermediate features.
  • DFC attention is applied either directly on these features or after spatial downsampling for further efficiency.
  • The channel-wise expansion is modulated by the generated DFC attention mask.
  • A compression (projection) stage reduces the output dimensionality back to match the residual connection (Tang et al., 2022, Hu et al., 2023).

Table: Comparison of Attention Modules in GhostNetV2 (Tang et al., 2022)

Module Params (M) FLOPs (M) ImageNet Top-1 (%) Latency (ms)
GhostNet V1 5.2 141 73.9 31.1
MobileViT-SA 5.2 172 74.4 72.3
GhostNetV2 (+DFC) 6.1 167 75.3 37.5

This integration strategy allows DFC attention to aggregate non-local and local information while keeping the bottleneck structure efficient for both training and deployment.

4. Computational Complexity and Hardware Friendliness

DFC attention achieves a significant reduction in computation through decoupling and (optionally) spatial reduction:

  • Parameter Count: Only H2+W2H^2 + W^2 per attention block (when pooling to H,WH, W spatial size).
  • FLOPs: Câ‹…(Hâ‹…W2+H2â‹…W)C \cdot (H \cdot W^2 + H^2 \cdot W) (for CC channels), orders of magnitude under global self-attention.
  • Implementation: Both horizontal and vertical operations can be fused into depth-wise convolution primitives (respectively, 1×W1 \times W and H×1H \times 1 filter kernels). Channel-wise independence enables parallelization; the standard NHWC/NCHW data layouts can be used efficiently on ARM CPUs and modern GPUs (Tang et al., 2022).

Spatial downsampling before DFC attention (e.g., using pooling to H/2H/2, W/2W/2) can further reduce compute without accuracy degradation.

5. Empirical Performance

Empirical evidence demonstrates that DFC attention provides both accuracy and efficiency improvements across several domains:

  • ImageNet Classification: GhostNet V2 with DFC attention (1.0× width) achieves 75.3% top-1 accuracy at 167M FLOPs, a 1.4% gain over GhostNet V1 and competitive latency (37.5ms). Combining DFC attention on both expansion and output branches further lifts performance to 75.5% (Tang et al., 2022).
  • MobileNetV2 Backbone: Adding DFC surpasses SE, CBAM, and Coordinate-Attention: 75.4% versus 74.5% (Coordinate-Attention), at similar parameter and FLOP budgets (Tang et al., 2022).
  • Skin Lesion Detection: In dermoscopic analysis (HAM10000), DFC-augmented lightweight models achieve 92.4% accuracy and an 85.4% F1, with only a modest increase in FLOPs (from 53.6M to 63.6M), outperforming previous SE/CBAM-based methods, especially in minority class recall (Hu et al., 2023).
  • Downstream Detection/Segmentation: GhostNet V2 (with DFC) outperforms GhostNet V1 on object detection (COCO, AP=22.3 vs 21.8) and ADE20K segmentation (mIoU=35.52% vs 34.17%) at negligible cost increase.

DFC attention fundamentally differs from conventional attention modules:

  • Self-attention (ViT, MSA): O((H·W)²·C) FLOPs, O((H·W)²) memory, high modeling capacity but cost-prohibitive outside high-throughput environments.
  • SE-block, CBAM, Coordinate-Attention: Channel- or axis-wise recalibration with global pooling, inexpensive but lose cross-spatial interaction capability.
  • DFC Attention: Approximates full self-attention’s global effect via axis-wise separable linear layers, achieves global receptive field, and offers a superior accuracy–efficiency tradeoff, especially when hardware and energy budgets are tight (Tang et al., 2022, Hu et al., 2023).

Contrary to potential misconception, DFC attention is not merely a fusion of spatial and channel attention; it represents a mathematically distinct factorization of token–token relations along orthogonal axes.

7. Applications and Limitations

DFC attention excels in vision tasks, especially when global context must be captured at minimal computational expense. Demonstrated applications include:

  • Mobile and embedded vision: real-time classification, detection, and segmentation on low-power devices (Tang et al., 2022, Hu et al., 2023).
  • Medical image analysis: robust detection of subtle/class-imbalanced pathologies in dermoscopy at low energy cost (Hu et al., 2023).

A plausible implication is that as model scaling and deployment constraints intensify, DFC’s approach provides a template for further decompositions of dense attention that are compatible with existing deep learning frameworks and hardware primitives.

Current limitations include the assumption of axis-wise independence (potentially missing certain diagonal or higher-order interactions) and a focus on spatial 2D features; generalization to higher-dimensional or sequence-based contexts requires further development.


Key Sources:

  • "GhostNetV2: Enhance Cheap Operation with Long-Range Attention" (Tang et al., 2022)
  • "Attention-Driven Lightweight Model for Pigmented Skin Lesion Detection" (Hu et al., 2023)
Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DFC Attention.