Papers
Topics
Authors
Recent
2000 character limit reached

MaxViT: Hierarchical Multi-Axis Vision Backbone

Updated 14 November 2025
  • MaxViT is a hierarchical vision backbone that fuses local window and global grid sparse self-attention with convolutional modules for scalable, efficient feature extraction.
  • It integrates MobileNetV2-style MBConv and Squeeze-and-Excitation modules to achieve state-of-the-art performance on tasks like ImageNet classification, COCO detection, and clinical stroke CT analysis.
  • Its design achieves linear computational complexity relative to image size, ensuring effective local-global feature fusion and compatibility with explainable AI methods like Grad-CAM++.

MaxViT is a hierarchical vision backbone that integrates multi-axis (local and global) sparse self-attention with convolutional operators, designed to combine scalable receptive fields, local–global spatial feature fusion, and efficient computation for a range of computer vision tasks, including image classification, detection, segmentation, and image-based medical diagnostics. The central architectural novelty of MaxViT is its sequential composition of windowed local self-attention and grid-based global self-attention within each stage, yielding linear time and memory complexity with respect to input resolution, while maintaining global context propagation from early layers. MaxViT’s blocks are further augmented with MobileNetV2-style MBConv and Squeeze-and-Excitation (SE) modules, producing a repeatable, modular unit for hierarchical multi-stage feature extraction. Empirical evaluations demonstrate MaxViT achieves state-of-the-art or competitive results on large-scale benchmarks such as ImageNet-1K, COCO detection, AVA image aesthetics, and even unconditional GAN generation (Tu et al., 2022). Its clinical relevance has been validated in medical imaging, outperforming prior approaches in multiclass stroke classification on CT scans, especially when coupled with targeted data augmentation and explainable AI (Qari et al., 13 Jul 2025).

1. Multi-Axis Attention: Local and Global Sparse Self-Attention

MaxViT’s fundamental mechanism, multi-axis attention, consists of two sequential sparse self-attention operations per block:

  1. Blocked Local (Window) Attention: The feature map XRH×W×CX \in \mathbb{R}^{H \times W \times C} is partitioned into P×PP \times P non-overlapping windows. Within each, standard multi-head self-attention is computed:

Aw=Softmax(QwKwd+Bloc)VwA_w = \mathrm{Softmax}\left(\frac{Q_w K_w^\top}{\sqrt{d}} + B_{loc}\right) V_w

where Qw,Kw,VwQ_w, K_w, V_w are linear projections of reshaped window tokens, and BlocB_{loc} is a learnable relative positional bias.

  1. Dilated Global (Grid) Attention: The same input XX is overlaid with a G×GG \times G grid, slicing the feature map into coarse grid nodes. Attention is computed globally over these grid tokens:

Ag=Softmax(QgKgd+Bglob)VgA_g = \mathrm{Softmax}\left(\frac{Q_g K_g^\top}{\sqrt{d}} + B_{glob}\right) V_g

This operation communicates information across distant spatial locations efficiently.

The sequential application of window and grid attention within each residual block ensures both fine-grained locality and unrestricted long-range context at every network depth, including early high-resolution stages.

2. Architectural Composition and Building Blocks

Each MaxViT block comprises the following ordered submodules:

  1. Mobile ConvNet (MBConv) with Squeeze-and-Excitation: Expands channels, applies depthwise convolution, SE attention, and projects back.
  2. Block Attention + MLP: Local window attention with LayerNorm followed by a two-layer MLP and residual connection.
  3. Grid Attention + MLP: Global grid attention with LayerNorm, MLP, and residual skip.

Formally, the block transformation is:

xMBConv(x) xx+Unblock(fblk(Block(LN(x)))) xx+fmlp(LN(x)) xx+Ungrid(fblk(Grid(LN(x)))) xx+fmlp(LN(x))\begin{aligned} x &\leftarrow \text{MBConv}(x) \ x &\leftarrow x + \mathrm{Unblock}(f_{blk}(\mathrm{Block}(\mathrm{LN}(x)))) \ x &\leftarrow x + f_\mathrm{mlp}(\mathrm{LN}(x)) \ x &\leftarrow x + \mathrm{Ungrid}(f_{blk}(\mathrm{Grid}(\mathrm{LN}(x)))) \ x &\leftarrow x + f_\mathrm{mlp}(\mathrm{LN}(x)) \end{aligned}

where fblk()f_{blk}(\cdot) is the core relative attention kernel, and fmlp()f_\mathrm{mlp}(\cdot) denotes the two-layer feed-forward module with GeLU activation.

3. Hierarchical Backbone and Scaling

MaxViT adopts a standard 4-stage hierarchical structure, with each stage halving the spatial resolution and doubling channel width:

  • Stem: Two 3×33\times 3 convolutions.
  • Stages S1–S4: Each defined by output stride, number of blocks, and output channels. Example for MaxViT-B: B1=2,C1=96; B2=6,C2=192; B3=14,C3=384; B4=2,C4=768B_1=2, C_1=96;~ B_2=6, C_2=192;~ B_3=14, C_3=384;~ B_4=2, C_4=768.
  • Downsampling: Performed by 3×33 \times 3 convolutions with stride 2, followed by LayerNorm.

Global context propagation is established in the earliest layers via grid attention, which is empirically critical for accuracy—removal of early grid attention produces a \sim0.6% drop in ImageNet top-1 (Tu et al., 2022).

Complexity and Memory

If N=H×WN = H \times W, with block size PP and grid size GG,

  • Block attention: O(NP2)O(N \cdot P^2)
  • Grid attention: O(NG2)O(N \cdot G^2)
  • Total: O(N(P2+G2))O(N \cdot (P^2 + G^2)), linear in image size. Memory usage scales similarly, enabling application to high-resolution visual tasks.

4. Performance Metrics and Applications

Empirical Evaluation

Across standard computer vision tasks, MaxViT achieves:

  • ImageNet-1K Top-1 (no extra data):
    • MaxViT-T (31M params): 83.6%
    • MaxViT-B (120M params): 84.95%
    • Fine-tuning at higher resolution yields up to 86.70% (MaxViT-L, 512x512).
  • Object detection (COCO, Cascade Mask R-CNN, 896×896):
    • MaxViT-B: APb_b = 53.4, APm_m = 45.7
  • Image aesthetics (AVA):
    • MaxViT-T @ 512: PLCC 0.745, SRCC 0.708
  • Unconditional GAN (ImageNet-1K @128):
    • MaxViT-GAN (18.6M): FID 30.77, IS 22.58

Clinical Deployment: Stroke CT Classification

A domain-specific deployment on 2D brain CT for multiclass stroke detection achieved the following (Qari et al., 13 Jul 2025):

  • Dataset: 6,650 PNG slices; 3 classes (normal, ischemic, hemorrhagic).
  • Augmentation: cGAN-generated synthetic samples + classical transforms.
  • Pipeline: 224x224x3 input, convolutional stem, four-stage MaxViT, global pooling, 3-way classification.
  • Optimization: Adam, initial LR 3×1043\times 10^{-4}, batch size 32, dropout 0.05, weighted cross-entropy.
  • Test metrics (MaxViT + cGAN): Accuracy 98.00%, F1 98.00%.
  • Confusion matrix (N≈1,330): Strong classwise separation; e.g., 866/885 true normals correctly classified.

This performance eclipsed baselines, including ViT (F1 90%) and previous EfficientNet-B0 + SVM (F1 94.94%).

5. Explainability and Trust: Integration of Grad-CAM++

Explainable AI was implemented via Grad-CAM++ on MaxViT’s convolutional and attention blocks. For predicted class cc and activation map AkA^k, pixelwise weights αijk\alpha_{ij}^k are calculated as in Grad-CAM++:

αijk=2yc(Aijk)222yc(Aijk)2+a,bAabk3yc(Aabk)3\alpha_{ij}^k = \frac{ \frac{\partial^2 y^c}{(\partial A_{ij}^k)^2} }{\,2\frac{\partial^2 y^c}{(\partial A_{ij}^k)^2} + \sum_{a,b} A_{ab}^k\,\frac{\partial^3 y^c}{(\partial A_{ab}^k)^3}\, }

The final saliency map is:

LGradCAM++c=ReLU(kwkcAk),wkc=i,jαijkReLU(ycAijk)L^{c}_{\mathrm{GradCAM++}} = \mathrm{ReLU}\left(\sum_k w_k^c\,A^k\right), \quad w_k^c = \sum_{i,j} \alpha^k_{ij} {\rm ReLU}\left(\frac{\partial y^c}{\partial A_{ij}^k}\right)

Empirically, deeper layers of MaxViT produce sharply localized attention “hotspots” over pathologic tissue, supporting clinical interpretability and trust in AI decisions.

6. Advantages, Limitations, and Prospective Directions

Advantages

  • Global context from early layers: Ensures maximum spatial coverage throughout the network, without quadratic scaling.
  • Strong local-global feature fusion: Essential for medical imaging, where both fine texture and global structure are diagnostic.
  • Robust performance under imbalance: Data augmentation with cGAN and weighted loss improves sensitivity for rare classes (e.g., hemorrhagic strokes).
  • Integrated explainability: Native compatibility with post hoc methods produces credible clinical evidence overlays.

Limitations

  • Computational burden: Multi-axis attention blocks are resource-intensive, especially for large models or high-resolution images.
  • 2D dependency in domain adaptations: Loss of 3D contextual information in stack-wise image pipelines, limiting volumetric pathology assessment.
  • Synthetic data domain gap: cGAN-generated images may omit subtle clinical features, possibly affecting rare pathology performance.

Future Directions

  • Extension to 3D MaxViT for volumetric imaging modalities.
  • Development of computationally efficient variants (e.g., patch-pruned attention).
  • Integration of non-image patient data.
  • Enhanced XAI through alignment with expert annotations.

7. Pseudocode and Implementation Overview

A representative MaxViT block’s pseudo-implementation illustrates the sequential wiring of local and grid attention:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
def MaxViT_Block(x, P=7, G=7):
    # x: (batch, H, W, C)
    # Block attention
    b1 = rearrange(x, 'b (h p) (w p) c -> b (h w) (p p) c', p=P)
    attn1 = RelAttention(b1)
    x = x + rearrange(attn1, 'b (h w) (p p) c -> b (h p) (w p) c', h=H//P, w=W//P)
    x = x + MLP(LayerNorm(x))
    # Grid attention
    g1 = rearrange(x, 'b (g h) (g w) c -> b g g (h w) c', g=G)
    g1 = rearrange(g1, 'b g g n c -> b (g g) n c')
    attn2 = RelAttention(g1)
    g1r= rearrange(attn2, 'b (g g) n c -> b g g n c', g=G)
    x = x + rearrange(g1r, 'b g g (h w) c -> b (g h) (g w) c', h=H//G, w=W//G)
    x = x + MLP(LayerNorm(x))
    return x
Key implementation parameters include: block and grid sizes P=G=7P=G=7, hidden dimension per head 32, MBConv expansion factor 4, SE reduction 0.25, stochastic depth to regularize deep variants, LayerNorm pre-attention, and BatchNorm within MBConv (Tu et al., 2022).

MaxViT’s design philosophy—interleaving efficient local and global attention at all levels, augmented by convolutional priors—enables its consistent strong performance in both generic computer vision and specialized scientific domains.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to MaxViT.