Through-Plane Attention Blocks (TAB)

Updated 3 July 2026

Through-Plane Attention Blocks (TAB) are specialized neural architecture components that use 1D self-attention to capture inter-slice dependencies in volumetric medical imaging.
TAB transforms 4D tensor features via global average pooling and multi-head linear projections to efficiently compute depth-wise self-attention without the full 3D computational burden.
Empirical results show that incorporating TAB significantly improves CT image denoising and deblurring metrics, providing sharper details and better restoration quality in low-dose settings.

Through-Plane Attention Blocks (TAB) constitute the through-plane branch of the efficient multi-head self-attention module (eMSM-T) in LIT-Former, a transformer-convolutional hybrid architecture for 3D low-dose CT image denoising and deblurring. TAB is specifically designed to model inter-slice (through-plane) dependencies along the depth dimension of volumetric medical data, enabling efficient and effective representation learning without the computational burden of full 3D self-attention. This approach allows LIT-Former to address the dual challenges of in-plane denoising and through-plane deblurring for high-quality CT reconstruction in low-dose or rapid-acquisition regimes (Chen et al., 2023).

1. Core Architecture and Computation

TAB processes a 4D tensor feature map $F_{l-1} \in \mathbb{R}^{C \times D \times H \times W}$ from the previous block. The mechanism operates as follows:

Global Average Pooling (GAP) over the in-plane dimensions $(H, W)$ collapses local spatial features into a depth-wise representation:

$X_{\mathrm{th}} = \operatorname{GAP}_{\mathrm{th}}(F_{l-1}) \in \mathbb{R}^{C \times D}$

Multi-head Structure: $X_{\mathrm{th}}$ is split into $h$ heads, each of size $d_k = C/h$ .
Linear Projection: For each head $i$ ,

$Q_{\mathrm{th}}^i = f_{\mathrm{th}}^Q(X_{\mathrm{th}}^i), \quad K_{\mathrm{th}}^i = f_{\mathrm{th}}^K(X_{\mathrm{th}}^i), \quad V_{\mathrm{th}}^i = f_{\mathrm{th}}^V(X_{\mathrm{th}}^i)$

where $f_{\mathrm{th}}^{(\cdot)}$ are learned linear transformations.

Self-Attention Along Depth: The attention matrix over the $D$ slices is computed as

$(H, W)$ 0

Aggregation: Each head output is calculated by applying the attention to the value projection:

$(H, W)$ 1

Concatenation and Output Projection: Heads are concatenated, projected by $(H, W)$ 2, and the output tensor $(H, W)$ 3 is reshaped back to $(H, W)$ 4.

The overall eMSM combines TAB (through-plane branch), in-plane attention (eMSM-I), and a residual path via element-wise summation:

$(H, W)$ 5

2. Mathematical Formulation and Processing Pipeline

The following table summarizes the major tensorial operations and flow in TAB:

Stage	Operation	Output Shape
GAP over $(H, W)$ 6	$(H, W)$ 7	$(H, W)$ 8
Linear projections	$(H, W)$ 9	$X_{\mathrm{th}} = \operatorname{GAP}_{\mathrm{th}}(F_{l-1}) \in \mathbb{R}^{C \times D}$ 0 (per head)
Self-attention	$X_{\mathrm{th}} = \operatorname{GAP}_{\mathrm{th}}(F_{l-1}) \in \mathbb{R}^{C \times D}$ 1	$X_{\mathrm{th}} = \operatorname{GAP}_{\mathrm{th}}(F_{l-1}) \in \mathbb{R}^{C \times D}$ 2
Attention aggregation	$X_{\mathrm{th}} = \operatorname{GAP}_{\mathrm{th}}(F_{l-1}) \in \mathbb{R}^{C \times D}$ 3	$X_{\mathrm{th}} = \operatorname{GAP}_{\mathrm{th}}(F_{l-1}) \in \mathbb{R}^{C \times D}$ 4 (per head)
Concatenation + output	$X_{\mathrm{th}} = \operatorname{GAP}_{\mathrm{th}}(F_{l-1}) \in \mathbb{R}^{C \times D}$ 5	$X_{\mathrm{th}} = \operatorname{GAP}_{\mathrm{th}}(F_{l-1}) \in \mathbb{R}^{C \times D}$ 6

TAB converts spatial-extent features into a depth-sequence, enabling the transformer core to model long-range dependencies along the slice axis alone. This structure is pivotal for longitudinal context modeling without cubic scaling of token count.

3. Computational Efficiency and Design Rationale

Applying full 3D self-attention to a tensor of shape $X_{\mathrm{th}} = \operatorname{GAP}_{\mathrm{th}}(F_{l-1}) \in \mathbb{R}^{C \times D}$ 7 entails $X_{\mathrm{th}} = \operatorname{GAP}_{\mathrm{th}}(F_{l-1}) \in \mathbb{R}^{C \times D}$ 8 complexity due to global attention over all $X_{\mathrm{th}} = \operatorname{GAP}_{\mathrm{th}}(F_{l-1}) \in \mathbb{R}^{C \times D}$ 9 tokens. Decomposing the mechanism into decoupled in-plane (2D) and through-plane (1D) attention, with TAB responsible for the latter, reduces complexity to $X_{\mathrm{th}}$ 0. This is because TAB computes only a $X_{\mathrm{th}}$ 1 attention matrix, instead of the substantially larger $X_{\mathrm{th}}$ 2 matrix required for full 3D attention (Chen et al., 2023).

A comparison with 3D convolutional layers further illustrates efficiency: factorized (2+1)D convolution reduces FLOPs from $X_{\mathrm{th}}$ 3 for standard 3D convolution to $X_{\mathrm{th}}$ 4, and similarly cuts parameter count, paralleling the computational savings achieved by TAB for self-attention operations.

4. Parallel Fusion, Interaction with eCFN, and Network Flow

TAB (eMSM-T) operates in parallel with eMSM-I (in-plane attention), and their outputs are summed, along with the residual connection. This fusion strategy contrasts with serial or cascaded arrangements and is empirically found to yield better restoration metrics. The fused output $X_{\mathrm{th}}$ 5 is then passed to an efficient convolutional feed-forward network (eCFN), which includes a $X_{\mathrm{th}}$ 6 in-plane convolution, a $X_{\mathrm{th}}$ 7 through-plane convolution, and identity mapping (or $X_{\mathrm{th}}$ 8 conv if needed).

The processing sequence is thus: $h$ 9

5. Empirical Contributions and Ablation Findings

Ablation studies demonstrate that eMSM-T, i.e., TAB, confers greater quantitative gains than in-plane attention alone for 3D CT reconstruction tasks. On a clinical dataset:

(2+1)DUnet baseline: PSNR 41.49, RMSE 0.80, SSIM $X_{\mathrm{th}}$ 9 97.49, SSIM $h$ 0 97.06
(2+1)DUnet + eMSM-I: PSNR 42.48, RMSE 0.70, SSIM $h$ 1 97.63, SSIM $h$ 2 97.19
(2+1)DUnet + eMSM-T: PSNR 42.89, RMSE 0.67, SSIM $h$ 3 97.72, SSIM $h$ 4 97.28
LIT-Former: PSNR 43.10, RMSE 0.65, SSIM $h$ 5 97.74, SSIM $h$ 6 97.31

The isolated through-plane branch (eMSM-T) alone yields a $h$ 71.40 dB improvement in PSNR, compared to $h$ 80.99 dB for eMSM-I. Parallel fusion outperforms cascaded fusion by 0.12 dB in PSNR, substantiating the choice of the summing strategy for optimal accuracy.

Qualitative analysis links TAB to clearer details in coronal/sagittal slices, sharper edges, and improved intensity consistency across slices, especially critical for through-plane deblurring.

6. Context and Impact in 3D Medical Imaging

Prior approaches typically employ either 2D convolution/attention (neglecting inter-slice context) or full 3D architectures (with prohibitive computational and data requirements). TAB, as realized in LIT-Former, attains an effective tradeoff by modeling depth dependencies with 1D self-attention, while retaining computational tractability. This design is crucial for CT applications demanding high-quality volumetric reconstruction from low-dose or reduced-projection data. The architectural pattern demonstrated by TAB is likely extensible to other volumetric or sequential data domains where cross-slice or cross-frame dependencies are central and global 3D context is essential but full token-wise attention is infeasible (Chen et al., 2023).

Markdown Report Issue Upgrade to Chat

References (1)

LIT-Former: Linking In-plane and Through-plane Transformers for Simultaneous CT Image Denoising and Deblurring (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Through-Plane Attention Blocks (TAB).