Full Attention (FA) in Deep Learning
- Full Attention (FA) is an architectural mechanism that unifies channel-spatial or bidirectional temporal dependencies into a single similarity map, addressing the 'attention missing' issue.
- It enhances performance in tasks like semantic segmentation and speech enhancement by integrating global context using efficient computation and reduced resource demands.
- Its formulation combines pooling, slicing, and windowed attention to simultaneously capture fine structural details and long-range dependencies with measurable empirical gains.
Full Attention (FA) denotes an architectural mechanism designed to simultaneously integrate multi-dimensional dependencies within a single module, eliminating the need for stacking or alternation between independent attention blocks. FA modules have been introduced to address inherent information loss in conventional self-attention paradigms, both in computer vision (semantic segmentation) and speech enhancement, by capturing channel-spatial or forward-backward temporal features in a unified structure. The approach yields both computational efficiency and empirical performance improvements over standard attention architectures. Key formulations appear in semantic segmentation ("Fully Attentional Network for Semantic Segmentation" (Song et al., 2021)) and bidirectional speech enhancement ("Full Attention Bidirectional Deep Learning Structure for Single Channel Speech Enhancement" (Yan et al., 2021)).
1. Motivation: Overcoming Dimensional Collapsing and Attention Missing
Conventional non-local/self-attention schemes generally compute similarity maps along a single tensor axis, either spatial (affinity of shape ) or channel-wise (), by pooling or compressing the remaining dimensions. For instance:
- Channel NL: Computes channel-wise affinities by flattening spatial dimensions, enabling information transfer between channel maps but prohibiting inter-spatial dependencies.
- Spatial NL: Computes spatial affinities by collapsing the channel dimension, linking pixelwise representations but treating each channel independently.
This dimension-wise decoupling results in an "attention missing" phenomenon: Channel NL improves consistency for large objects but loses fine structure, while Spatial NL preserves thin/small categories but results in label inconsistency inside large regions. Combined or stacked (e.g., Dual/CS NL), such blocks still only provide partial 3D context because each sub-block observes a single axis at a time (Song et al., 2021). In time-series modeling, unidirectional attention restricts the decoder to only historical context, missing important latent cues in future frames, especially detrimental for speech enhancement tasks (Yan et al., 2021).
Full Attention modules are constructed to overcome these limitations by producing a single similarity map, enabling comprehensive modeling of both channel and spatial—or forward and backward temporal—dependencies in a single computational pass.
2. Mathematical Formulation of Full Attention
2.1 Semantic Segmentation: Channel-Spatial Full Attention
Given input tensor , FA constructs global priors and key/value blocks to simultaneously encode channel and spatial relationships:
- Global Priors:
- Row-pooling yields , tiled along to .
- Column-pooling yields , tiled along to .
- Both are cut into and %%%%10%%%% slices ( and ), stacked to , , for square maps.
- Key/Value Preparation:
- Same slicing/merging produces , .
- Affinity, Context, and Output:
- For group , channel affinities: (row-wise softmax).
- Aggregated context: .
- Slices are reshaped back, summed across -cut and -cut to context map .
- FA output: with learnable scalar (initialized at $0$).
2.2 Speech Enhancement: Bidirectional Full Attention
Let the noisy feature sequence , :
- Embedding and Encoding: .
- Bidirectional Keys and Queries: Produced via forward/backward LSTM encoders:
- Forward key: , backward key: .
- Forward query: , backward query: .
- Windowed Multiplicative Attention:
- Forward-attention: , .
- Backward-attention: , .
- Context Vectors and Decoder:
- Forward context:
- Backward context:
- Full context: (concatenation)
- Decoder combines and key/query to produce gain mask (sigmoid) and output features .
3. Algorithmic Procedure and Architectural Integration
3.1 Semantic Segmentation (FLANet + FA Block)
- Extract backbone features (e.g., ResNet-101, $1/8$ resolution).
- Two convolutions reduce channels .
- Apply FA block on as described above.
- Pass result to a conv + upsampling prediction head for segmentation logits.
- FA replaces non-local blocks, functioning as a plug-and-play module after channel reduction and before classification (Song et al., 2021).
3.2 Speech Enhancement (Bidirectional seq-to-seq with FA)
- Dense embedding, parallel forward and backward LSTM encoders for keys/queries.
- FA module aggregates both past and future frames using windowed attention.
- Dense decoder produces enhancement vector and gain mask.
- Enhanced frame output via element-wise application of gain mask (Yan et al., 2021).
4. Computational Complexity and Efficiency
A critical advantage of FA is the unification of dependencies with controlled computation:
- FA (vision): with (square maps). For input, FA adds 9.7 GFLOPs and 436 MB memory, as opposed to 113.6 GFLOPs and 1.38 GB for Dual NL—a reduction of 83% FLOPs and 66% GPU memory (Song et al., 2021).
- Baseline costs: Channel NL: ; Spatial NL: ; Dual: sum of both.
- Speech enhancement: FA window sizes are tunable; window settings maximize performance. Performance drops if windows are too wide (dilution) or too narrow (context loss) (Yan et al., 2021).
5. Empirical Performance and Ablation Analysis
5.1 Semantic Segmentation
- Cityscapes test set:
- Res101+FLA: 83.0% mIoU (prior best: 82.4%)
- HRNetW48+FLA: 83.6% (OCR: 84.2%, with extra data)
- Cityscapes val-set ablation (Res50):
- Baseline: 74.5% (single-scale), 75.8% (multi-scale+flip)
- +Channel NL: 75.6% / 76.9%
- +Spatial NL: 76.3% / 77.6%
- +Dual/CS NL: ≈77.1–77.4% / 78.0–78.2%
- +FLA: 78.9% / 79.7% (∼2% gain over best stacking)
- ADE20K / PASCAL VOC:
- ADE20K val: FLANet (Res101): 46.68%; HRNetW48+FLA: 46.99%
- VOC test: FLANet (Res150): 88.5%
- Resource profile (Res50):
- Channel NL: 9.7 GFLOPs, 40 MB
- Spatial NL: 103.9 GFLOPs, 1.32 GB
- Dual/CS NL: 113.6 GFLOPs, 1.38 GB
- FLA: 19.4 GFLOPs, 436 MB
5.2 Speech Enhancement
- PESQ Comparison on THCHS-30 (Bi-Att = FA):
| SNR | LSTM-RNN | LSTM-Att | Bi-Att (FA) |
|---|---|---|---|
| -10 | 1.3532 | 1.3786 | 1.5615 |
| -5 | 1.7820 | 1.8277 | 1.8317 |
| 0 | 2.2624 | 2.4584 | 2.5195 |
| 5 | 2.4723 | 2.6439 | 2.6675 |
| 10 | 2.7027 | 2.8659 | 2.9185 |
Ablations indicate:
- Bidirectional FA outperforms unidirectional attention, especially under low SNR/irregular noise.
- Asymmetric windows () map onto asymmetric temporal dependencies in speech.
- Optimal context is ; excessively large windows harm performance.
6. Context, Interpretation, and Impact
Full Attention modules offer a principled solution to the attention-missing problem observed in both spatial-channel (vision) and temporal (sequence) modeling. By constructing a similarity map that is global in all key dimensions, FA enables every unit (pixel or time frame) to relate to both its local and nonlocal context across axes previously factored independently. Integration into segmentation backbones is seamless (after channel reduction and before classification) and yields substantial empirical gains with only moderate additional computational cost (Song et al., 2021). In speech enhancement, bidirectional FA context resolves ambiguities that unidirectional architectures cannot, achieving state-of-the-art denoising metrics (Yan et al., 2021).
Future research may explore generalizing FA to more generic tensor structures, automating window selection, or hybridizing FA with transformer-based global attention mechanisms. A plausible implication is that the FA paradigm could extend to tasks requiring global context with tight resource budgets, such as real-time video segmentation or streaming speech enhancement.