Through-Plane Attention Blocks (TAB)
- Through-Plane Attention Blocks (TAB) are specialized neural architecture components that use 1D self-attention to capture inter-slice dependencies in volumetric medical imaging.
- TAB transforms 4D tensor features via global average pooling and multi-head linear projections to efficiently compute depth-wise self-attention without the full 3D computational burden.
- Empirical results show that incorporating TAB significantly improves CT image denoising and deblurring metrics, providing sharper details and better restoration quality in low-dose settings.
Through-Plane Attention Blocks (TAB) constitute the through-plane branch of the efficient multi-head self-attention module (eMSM-T) in LIT-Former, a transformer-convolutional hybrid architecture for 3D low-dose CT image denoising and deblurring. TAB is specifically designed to model inter-slice (through-plane) dependencies along the depth dimension of volumetric medical data, enabling efficient and effective representation learning without the computational burden of full 3D self-attention. This approach allows LIT-Former to address the dual challenges of in-plane denoising and through-plane deblurring for high-quality CT reconstruction in low-dose or rapid-acquisition regimes (Chen et al., 2023).
1. Core Architecture and Computation
TAB processes a 4D tensor feature map from the previous block. The mechanism operates as follows:
- Global Average Pooling (GAP) over the in-plane dimensions collapses local spatial features into a depth-wise representation:
- Multi-head Structure: is split into heads, each of size .
- Linear Projection: For each head ,
where are learned linear transformations.
- Self-Attention Along Depth: The attention matrix over the slices is computed as
0
- Aggregation: Each head output is calculated by applying the attention to the value projection:
1
- Concatenation and Output Projection: Heads are concatenated, projected by 2, and the output tensor 3 is reshaped back to 4.
The overall eMSM combines TAB (through-plane branch), in-plane attention (eMSM-I), and a residual path via element-wise summation:
5
2. Mathematical Formulation and Processing Pipeline
The following table summarizes the major tensorial operations and flow in TAB:
| Stage | Operation | Output Shape |
|---|---|---|
| GAP over 6 | 7 | 8 |
| Linear projections | 9 | 0 (per head) |
| Self-attention | 1 | 2 |
| Attention aggregation | 3 | 4 (per head) |
| Concatenation + output | 5 | 6 |
TAB converts spatial-extent features into a depth-sequence, enabling the transformer core to model long-range dependencies along the slice axis alone. This structure is pivotal for longitudinal context modeling without cubic scaling of token count.
3. Computational Efficiency and Design Rationale
Applying full 3D self-attention to a tensor of shape 7 entails 8 complexity due to global attention over all 9 tokens. Decomposing the mechanism into decoupled in-plane (2D) and through-plane (1D) attention, with TAB responsible for the latter, reduces complexity to 0. This is because TAB computes only a 1 attention matrix, instead of the substantially larger 2 matrix required for full 3D attention (Chen et al., 2023).
A comparison with 3D convolutional layers further illustrates efficiency: factorized (2+1)D convolution reduces FLOPs from 3 for standard 3D convolution to 4, and similarly cuts parameter count, paralleling the computational savings achieved by TAB for self-attention operations.
4. Parallel Fusion, Interaction with eCFN, and Network Flow
TAB (eMSM-T) operates in parallel with eMSM-I (in-plane attention), and their outputs are summed, along with the residual connection. This fusion strategy contrasts with serial or cascaded arrangements and is empirically found to yield better restoration metrics. The fused output 5 is then passed to an efficient convolutional feed-forward network (eCFN), which includes a 6 in-plane convolution, a 7 through-plane convolution, and identity mapping (or 8 conv if needed).
The processing sequence is thus: 9
5. Empirical Contributions and Ablation Findings
Ablation studies demonstrate that eMSM-T, i.e., TAB, confers greater quantitative gains than in-plane attention alone for 3D CT reconstruction tasks. On a clinical dataset:
- (2+1)DUnet baseline: PSNR 41.49, RMSE 0.80, SSIM9 97.49, SSIM0 97.06
- (2+1)DUnet + eMSM-I: PSNR 42.48, RMSE 0.70, SSIM1 97.63, SSIM2 97.19
- (2+1)DUnet + eMSM-T: PSNR 42.89, RMSE 0.67, SSIM3 97.72, SSIM4 97.28
- LIT-Former: PSNR 43.10, RMSE 0.65, SSIM5 97.74, SSIM6 97.31
The isolated through-plane branch (eMSM-T) alone yields a 71.40 dB improvement in PSNR, compared to 80.99 dB for eMSM-I. Parallel fusion outperforms cascaded fusion by 0.12 dB in PSNR, substantiating the choice of the summing strategy for optimal accuracy.
Qualitative analysis links TAB to clearer details in coronal/sagittal slices, sharper edges, and improved intensity consistency across slices, especially critical for through-plane deblurring.
6. Context and Impact in 3D Medical Imaging
Prior approaches typically employ either 2D convolution/attention (neglecting inter-slice context) or full 3D architectures (with prohibitive computational and data requirements). TAB, as realized in LIT-Former, attains an effective tradeoff by modeling depth dependencies with 1D self-attention, while retaining computational tractability. This design is crucial for CT applications demanding high-quality volumetric reconstruction from low-dose or reduced-projection data. The architectural pattern demonstrated by TAB is likely extensible to other volumetric or sequential data domains where cross-slice or cross-frame dependencies are central and global 3D context is essential but full token-wise attention is infeasible (Chen et al., 2023).