Papers
Topics
Authors
Recent
Search
2000 character limit reached

Class Attention Block (CAB)

Updated 2 April 2026
  • CAB is a neural network module that boosts joint feature aggregation and attention-based reweighting for improved feature extraction.
  • It enables efficient multi-stage channel fusion in encoder-decoder models and refines class tokens in transformer-based systems.
  • Empirical results show enhanced segmentation accuracy, reduced parameter cost, and effective continual learning through gated mechanisms.

A Class Attention Block (CAB) is a neural network module that operates on either channel or class-token dimensions and is designed to improve joint feature aggregation and attention-based reweighting. CAB modules are leveraged in diverse architectures, including U-shaped models for dense prediction and Vision Transformers for class-token-based representation learning. The two primary paradigms are: Channel Attention Bridge (CAB) for multi-stage channel fusion in encoder-decoder models (Ruan et al., 2022) and Class-Attention Block (CAB) for global class-token refinement in Transformer models (Cotogni et al., 2022). Both variants focus on extracting and combining the most informative features across input channels or tokens and modulating the information flow to subsequent network stages.

1. Channel Attention Bridge (CAB) in Encoder-Decoder Architectures

The Channel Attention Bridge (CAB) is designed for efficient multi-stage feature fusion in U-shaped architectures, such as MALUNet for medical image segmentation. In a six-stage encoder-decoder model, CAB sits in the skip-connection path between encoder and decoder, aggregating encoded features from stages 1 to 5, which have increasing channel sizes {8,16,24,32,48}\{8, 16, 24, 32, 48\} (Ruan et al., 2022).

Given per-stage feature maps ti∈RCi×Hi×Wit_i \in \mathbb{R}^{C_i \times H_i \times W_i}, CAB executes the following steps:

  1. Channel Descriptor Extraction: Each tit_i is globally average pooled to ti′∈RCit_i' \in \mathbb{R}^{C_i}.
  2. Multi-Stage Channel Fusion: All channel descriptors are concatenated, T=Concat(t1′,...,t5′)∈R128T = \mathrm{Concat}(t_1', ..., t_5') \in \mathbb{R}^{128}.
  3. Local Correlation via 1D Convolution: TT is passed through a 1D convolution, yielding T′=Conv1D(T)T' = \mathrm{Conv1D}(T).
  4. Stage-Wise Global Gating: For each stage ii, a stage-specific fully connected layer applies, yielding attention vector Atti=σ(FCi(T′))∈(0,1)CiAtt_i = \sigma(\mathrm{FC}_i(T')) \in (0,1)^{C_i}, with σ\sigma the element-wise sigmoid.
  5. Channel-Wise Reweighting and Residual Addition: The attention vector is broadcast and used for channel-wise scaling and residual addition: ti∈RCi×Hi×Wit_i \in \mathbb{R}^{C_i \times H_i \times W_i}0.

This mechanism ensures each decoder block receives an adaptively fused and selectively reweighted collection of encoder features, improving the expressivity-to-parameter ratio in compute-constrained settings. CAB employs a lightweight 1D convolution (e.g., kernel size ti∈RCi×Hi×Wit_i \in \mathbb{R}^{C_i \times H_i \times W_i}1, output dimension ti∈RCi×Hi×Wit_i \in \mathbb{R}^{C_i \times H_i \times W_i}2, zero dilation) and per-stage fully connected layers (ti∈RCi×Hi×Wit_i \in \mathbb{R}^{C_i \times H_i \times W_i}3) without normalization layers. All parameters are initialized using standard procedures (He initialization for Conv1D, Xavier for FC).

2. Mathematical Formulation of CAB in MALUNet

The CAB module in MALUNet (Ruan et al., 2022) is defined mathematically by equations (6)-(10):

ti∈RCi×Hi×Wit_i \in \mathbb{R}^{C_i \times H_i \times W_i}4

where ti∈RCi×Hi×Wit_i \in \mathbb{R}^{C_i \times H_i \times W_i}5 (number of encoder stages) and ti∈RCi×Hi×Wit_i \in \mathbb{R}^{C_i \times H_i \times W_i}6. ti∈RCi×Hi×Wit_i \in \mathbb{R}^{C_i \times H_i \times W_i}7 is global average pooling; ti∈RCi×Hi×Wit_i \in \mathbb{R}^{C_i \times H_i \times W_i}8 and ti∈RCi×Hi×Wit_i \in \mathbb{R}^{C_i \times H_i \times W_i}9 are learnable; tit_i0 is sigmoid.

Local context fusion occurs via Conv1D, enabling short-range channel interaction among stages, while FC layers provide stage-specific global attention. The design avoids normalization and maintains low parameter cost (on the order of a few thousand parameters for Conv1D and tit_i1 for FCs).

3. Standard Class-Attention Block in Vision Transformers

The Class-Attention Block (CAB) in transformer-based architectures (e.g., CaiT) incorporates a learnable class token tit_i2 alongside patch tokens tit_i3, forming an augmented sequence tit_i4 (Cotogni et al., 2022).

The standard attention mechanism proceeds as follows:

  • Linear projections for query (tit_i5), key (tit_i6), and value (tit_i7).
  • Attention weights: tit_i8, where tit_i9 and ti′∈RCit_i' \in \mathbb{R}^{C_i}0 is the number of heads.
  • Attended class-token update: ti′∈RCit_i' \in \mathbb{R}^{C_i}1.
  • MLP refinement and residuals: class-token is updated through an MLP and residual addition: ti′∈RCit_i' \in \mathbb{R}^{C_i}2, where ti′∈RCit_i' \in \mathbb{R}^{C_i}3; all projections and MLPs are learnable.

CAB thereby aggregates global image representations via direct attention between class and patch tokens, which are pivotal for classification tasks and task-specific transfer in continual learning.

4. Gated Class-Attention Block for Continual Learning

The Gated Class-Attention Block (GCAB) extends the standard transformer CAB to address catastrophic forgetting in exemplar-free continual learning (Cotogni et al., 2022). GCAB introduces task-specific soft masks that gate both forward activations and parameter gradients.

Key mechanisms:

  1. Soft Mask Parametrization: For each task ti′∈RCit_i' \in \mathbb{R}^{C_i}4, the mask ti′∈RCit_i' \in \mathbb{R}^{C_i}5 is defined by ti′∈RCit_i' \in \mathbb{R}^{C_i}6, where ti′∈RCit_i' \in \mathbb{R}^{C_i}7 is a learnable embedding, ti′∈RCit_i' \in \mathbb{R}^{C_i}8 a scaling factor, and ti′∈RCit_i' \in \mathbb{R}^{C_i}9 a one-hot task selector.
  2. Gated Attention Application: All Q/K/V projections and MLP activations are modulated. For example, T=Concat(t1′,...,t5′)∈R128T = \mathrm{Concat}(t_1', ..., t_5') \in \mathbb{R}^{128}0, T=Concat(t1′,...,t5′)∈R128T = \mathrm{Concat}(t_1', ..., t_5') \in \mathbb{R}^{128}1, and so on.
  3. Sparsity Regularization: The cumulative mask T=Concat(t1′,...,t5′)∈R128T = \mathrm{Concat}(t_1', ..., t_5') \in \mathbb{R}^{128}2 is updated by T=Concat(t1′,...,t5′)∈R128T = \mathrm{Concat}(t_1', ..., t_5') \in \mathbb{R}^{128}3. A loss term encourages sparsity in new masks except at already-allocated capacity.
  4. Gradient Masking/Weight Protection: Gradients are masked during backpropagation to preserve weights used by previous tasks: T=Concat(t1′,...,t5′)∈R128T = \mathrm{Concat}(t_1', ..., t_5') \in \mathbb{R}^{128}4, where T=Concat(t1′,...,t5′)∈R128T = \mathrm{Concat}(t_1', ..., t_5') \in \mathbb{R}^{128}5.

GCAB operates at the final transformer block, enabling task-specific activation patterns and selective plasticity. At inference, all stored task-specific masks are sequentially applied, and outputs concatenated, obviating the need for task-ID during test time. This approach distinguishes GCAB from hard parameter-isolation approaches, as it leverages shared weights and soft task gating.

5. Hyper-parameters and Implementation Details

CAB implementations are distinguished by carefully selected hyper-parameters:

  • For MALUNet CAB (Ruan et al., 2022):
    • Number of encoder stages fused: T=Concat(t1′,...,t5′)∈R128T = \mathrm{Concat}(t_1', ..., t_5') \in \mathbb{R}^{128}6
    • Conv1D kernel size: typically T=Concat(t1′,...,t5′)∈R128T = \mathrm{Concat}(t_1', ..., t_5') \in \mathbb{R}^{128}7 or T=Concat(t1′,...,t5′)∈R128T = \mathrm{Concat}(t_1', ..., t_5') \in \mathbb{R}^{128}8 (padding T=Concat(t1′,...,t5′)∈R128T = \mathrm{Concat}(t_1', ..., t_5') \in \mathbb{R}^{128}9)
    • Conv1D input/output dims: input TT0, output TT1
    • No dilation, batch normalization, or layer normalization
    • Initialization: He for Conv1D, Xavier for FC
  • For GCAB (Cotogni et al., 2022):
    • Mask embedding TT2
    • Mask scaling TT3 dynamically set during training
    • Most soft masks are shared across Q/K/V and MLP layers, minimizing additional parameters
    • Regularization parameter TT4 for sparsity

Both designs avoid unnecessary complexity and maintain computational efficiency, while yielding pronounced performance gains in their respective application domains.

6. Applications and Empirical Impact

CAB modules have demonstrated performance increases aligned with their efficient attention-driven fusion or selection:

  • MALUNet achieves improvement over UNet on skin lesion segmentation—specifically, mIoU and DSC metrics increased by TT5 and TT6 respectively. This result is achieved alongside dramatic reductions in parameter count (TT7) and computational cost (TT8), positioning CAB as an essential primitive in lightweight segmentation models (Ruan et al., 2022).
  • GCAB enables exemplar-free class incremental training for Vision Transformers, achieving competitive results on datasets such as CIFAR-100, Tiny-ImageNet, and ImageNet100 without rehearsal. The gating mechanism facilitates plasticity towards new tasks while constraining catastrophic forgetting, with no requirement for test-time task identification and modest inference cost increase (limited to the last transformer block) (Cotogni et al., 2022).

These empirical results validate the architectural advantages of CAB and its variants for both dense prediction and continual learning scenarios.

7. Comparative and Methodological Context

CAB in encoder-decoder architectures and class-attention in transformers address different but related challenges—feature fusion and class-wise representation learning. Unlike generic additive skip connections, the Channel Attention Bridge applies coordinated per-stage attention, guided by both local (Conv1D) and global (FC) context. Transformer-based CAB leverages a dedicated class token and token-class interaction, and in GCAB, further incorporates task-specific gating for continual learning.

GCAB distinguishes itself from classical parameter-isolation techniques (PackNet, Piggyback, HAT) by employing shared weights and runtime-applied soft masks, which are learned and regularized to enforce sparsity and weight protection, respectively (Cotogni et al., 2022). This design enables flexible, scalable continual learning without explicit hard routing or known task identifiers.

By integrating these modules, networks achieve targeted feature selection, inter-stage knowledge distillation, and—when appropriately extended—robust task transfer without sacrificing architectural efficiency.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Class Attention Block (CAB).