Class Attention Block (CAB)
- CAB is a neural network module that boosts joint feature aggregation and attention-based reweighting for improved feature extraction.
- It enables efficient multi-stage channel fusion in encoder-decoder models and refines class tokens in transformer-based systems.
- Empirical results show enhanced segmentation accuracy, reduced parameter cost, and effective continual learning through gated mechanisms.
A Class Attention Block (CAB) is a neural network module that operates on either channel or class-token dimensions and is designed to improve joint feature aggregation and attention-based reweighting. CAB modules are leveraged in diverse architectures, including U-shaped models for dense prediction and Vision Transformers for class-token-based representation learning. The two primary paradigms are: Channel Attention Bridge (CAB) for multi-stage channel fusion in encoder-decoder models (Ruan et al., 2022) and Class-Attention Block (CAB) for global class-token refinement in Transformer models (Cotogni et al., 2022). Both variants focus on extracting and combining the most informative features across input channels or tokens and modulating the information flow to subsequent network stages.
1. Channel Attention Bridge (CAB) in Encoder-Decoder Architectures
The Channel Attention Bridge (CAB) is designed for efficient multi-stage feature fusion in U-shaped architectures, such as MALUNet for medical image segmentation. In a six-stage encoder-decoder model, CAB sits in the skip-connection path between encoder and decoder, aggregating encoded features from stages 1 to 5, which have increasing channel sizes (Ruan et al., 2022).
Given per-stage feature maps , CAB executes the following steps:
- Channel Descriptor Extraction: Each is globally average pooled to .
- Multi-Stage Channel Fusion: All channel descriptors are concatenated, .
- Local Correlation via 1D Convolution: is passed through a 1D convolution, yielding .
- Stage-Wise Global Gating: For each stage , a stage-specific fully connected layer applies, yielding attention vector , with the element-wise sigmoid.
- Channel-Wise Reweighting and Residual Addition: The attention vector is broadcast and used for channel-wise scaling and residual addition: 0.
This mechanism ensures each decoder block receives an adaptively fused and selectively reweighted collection of encoder features, improving the expressivity-to-parameter ratio in compute-constrained settings. CAB employs a lightweight 1D convolution (e.g., kernel size 1, output dimension 2, zero dilation) and per-stage fully connected layers (3) without normalization layers. All parameters are initialized using standard procedures (He initialization for Conv1D, Xavier for FC).
2. Mathematical Formulation of CAB in MALUNet
The CAB module in MALUNet (Ruan et al., 2022) is defined mathematically by equations (6)-(10):
4
where 5 (number of encoder stages) and 6. 7 is global average pooling; 8 and 9 are learnable; 0 is sigmoid.
Local context fusion occurs via Conv1D, enabling short-range channel interaction among stages, while FC layers provide stage-specific global attention. The design avoids normalization and maintains low parameter cost (on the order of a few thousand parameters for Conv1D and 1 for FCs).
3. Standard Class-Attention Block in Vision Transformers
The Class-Attention Block (CAB) in transformer-based architectures (e.g., CaiT) incorporates a learnable class token 2 alongside patch tokens 3, forming an augmented sequence 4 (Cotogni et al., 2022).
The standard attention mechanism proceeds as follows:
- Linear projections for query (5), key (6), and value (7).
- Attention weights: 8, where 9 and 0 is the number of heads.
- Attended class-token update: 1.
- MLP refinement and residuals: class-token is updated through an MLP and residual addition: 2, where 3; all projections and MLPs are learnable.
CAB thereby aggregates global image representations via direct attention between class and patch tokens, which are pivotal for classification tasks and task-specific transfer in continual learning.
4. Gated Class-Attention Block for Continual Learning
The Gated Class-Attention Block (GCAB) extends the standard transformer CAB to address catastrophic forgetting in exemplar-free continual learning (Cotogni et al., 2022). GCAB introduces task-specific soft masks that gate both forward activations and parameter gradients.
Key mechanisms:
- Soft Mask Parametrization: For each task 4, the mask 5 is defined by 6, where 7 is a learnable embedding, 8 a scaling factor, and 9 a one-hot task selector.
- Gated Attention Application: All Q/K/V projections and MLP activations are modulated. For example, 0, 1, and so on.
- Sparsity Regularization: The cumulative mask 2 is updated by 3. A loss term encourages sparsity in new masks except at already-allocated capacity.
- Gradient Masking/Weight Protection: Gradients are masked during backpropagation to preserve weights used by previous tasks: 4, where 5.
GCAB operates at the final transformer block, enabling task-specific activation patterns and selective plasticity. At inference, all stored task-specific masks are sequentially applied, and outputs concatenated, obviating the need for task-ID during test time. This approach distinguishes GCAB from hard parameter-isolation approaches, as it leverages shared weights and soft task gating.
5. Hyper-parameters and Implementation Details
CAB implementations are distinguished by carefully selected hyper-parameters:
- For MALUNet CAB (Ruan et al., 2022):
- Number of encoder stages fused: 6
- Conv1D kernel size: typically 7 or 8 (padding 9)
- Conv1D input/output dims: input 0, output 1
- No dilation, batch normalization, or layer normalization
- Initialization: He for Conv1D, Xavier for FC
- For GCAB (Cotogni et al., 2022):
- Mask embedding 2
- Mask scaling 3 dynamically set during training
- Most soft masks are shared across Q/K/V and MLP layers, minimizing additional parameters
- Regularization parameter 4 for sparsity
Both designs avoid unnecessary complexity and maintain computational efficiency, while yielding pronounced performance gains in their respective application domains.
6. Applications and Empirical Impact
CAB modules have demonstrated performance increases aligned with their efficient attention-driven fusion or selection:
- MALUNet achieves improvement over UNet on skin lesion segmentation—specifically, mIoU and DSC metrics increased by 5 and 6 respectively. This result is achieved alongside dramatic reductions in parameter count (7) and computational cost (8), positioning CAB as an essential primitive in lightweight segmentation models (Ruan et al., 2022).
- GCAB enables exemplar-free class incremental training for Vision Transformers, achieving competitive results on datasets such as CIFAR-100, Tiny-ImageNet, and ImageNet100 without rehearsal. The gating mechanism facilitates plasticity towards new tasks while constraining catastrophic forgetting, with no requirement for test-time task identification and modest inference cost increase (limited to the last transformer block) (Cotogni et al., 2022).
These empirical results validate the architectural advantages of CAB and its variants for both dense prediction and continual learning scenarios.
7. Comparative and Methodological Context
CAB in encoder-decoder architectures and class-attention in transformers address different but related challenges—feature fusion and class-wise representation learning. Unlike generic additive skip connections, the Channel Attention Bridge applies coordinated per-stage attention, guided by both local (Conv1D) and global (FC) context. Transformer-based CAB leverages a dedicated class token and token-class interaction, and in GCAB, further incorporates task-specific gating for continual learning.
GCAB distinguishes itself from classical parameter-isolation techniques (PackNet, Piggyback, HAT) by employing shared weights and runtime-applied soft masks, which are learned and regularized to enforce sparsity and weight protection, respectively (Cotogni et al., 2022). This design enables flexible, scalable continual learning without explicit hard routing or known task identifiers.
By integrating these modules, networks achieve targeted feature selection, inter-stage knowledge distillation, and—when appropriately extended—robust task transfer without sacrificing architectural efficiency.