ACC-ViT: Atrous Multi-Branch Vision Transformer

Updated 25 February 2026

The paper’s main contribution is the novel fusion of regional and sparse attention via an atrous multi-branch architecture.
It employs parallel dilated attention and convolution branches with adaptive gating, resulting in enhanced accuracy and parameter efficiency.
Empirical results on ImageNet and medical benchmarks demonstrate that ACC-ViT outperforms traditional CNNs and transformer-only models.

The Atrous Multi-Branch Formulation (ACC-ViT) is a contemporary architectural paradigm in vision transformer design in which multi-scale local and global context is encapsulated using a spatial pyramid of dilated (“atrous”) attention windows and convolutions. This approach directly fuses regional (windowed) and sparse (grid) attention schemes by deploying learnable, parallelized attention branches at multiple dilation levels and adaptively fusing their outputs via a gating mechanism. The design yields a hybrid architecture, consistently surpassing both pure CNN and transformer-only models across a range of vision benchmarks, including classification, detection, segmentation, and zero-shot tasks (Ibtehaz et al., 2024, Ibtehaz et al., 2024).

1. Principle of Atrous Multi-Branch Formulation

At the core of ACC-ViT is the intuition that vision models must simultaneously capture fine-grained details and long-range dependencies. Standard regional (windowed) attention restricts interactions to local non-overlapping regions of the image, effectively preserving hierarchical spatial structure but at the cost of losing global contextual information. In contrast, sparse (grid-based) attention enables broader context aggregation by interacting across distant image regions, but can dilute essential locality and introduce irrelevant dependencies.

ACC-ViT resolves this duality by constructing $K+1$ parallel sets of feature windows on the input tensor $X \in \mathbb{R}^{C \times H \times W}$ : an undilated partition ( $d_0=1$ ) and $K$ dilated (stride- $d_k$ ) partitions ( $d_k\in\{2,4,8\}$ in early stages). Each partition undergoes standard windowed multi-head self-attention (W-MHSA), and the per-branch outputs are adaptively gated and fused before further processing by a shared MLP (Ibtehaz et al., 2024, Ibtehaz et al., 2024).

2. Formal Definition and Mathematical Structure

Regional (Windowed) Attention

Divide $X$ into $N = \frac{HW}{w^2}$ non-overlapping windows of size $w \times w$ . For each window $i$ : $X^{(1)}_i \in \mathbb{R}^{w^2 \times C}$

$Q^{(1)}_i = X^{(1)}_i W_Q,\quad K^{(1)}_i = X^{(1)}_i W_K,\quad V^{(1)}_i = X^{(1)}_i W_V$

$A^{(1)}_i = \mathrm{softmax}\left(\frac{Q^{(1)}_i (K^{(1)}_i)^T}{\sqrt{d_k}} + B^{(1)}\right)$

$Y^{(1)}_i = A^{(1)}_i V^{(1)}_i$

Sparse (Grid) Attention

Sample $X$ on a coarse grid, stride $s>1$ : $X^{(\mathrm{sparse})} = P_{\mathrm{sparse}}(X) \in \mathbb{R}^{M \times C}$ with $M \ll HW$ .

Atrous Attention (Fusion)

For each $d\in\{1,2,4,8\}$ , the atrous windowing operator $P_d$ extracts stride- $d$ sampled windows: $X^{(d)}_i = P_d(X)_i \in \mathbb{R}^{w^2 \times C}$ Compute windowed attention in parallel: $Q^{(d)}_i = X^{(d)}_i W_Q,\quad K^{(d)}_i = X^{(d)}_i W_K,\quad V^{(d)}_i = X^{(d)}_i W_V$

$A^{(d)}_i = \mathrm{softmax}\left(\frac{Q^{(d)}_i (K^{(d)}_i)^T}{\sqrt{d_k}} + B^{(d)}\right)$

$Y^{(d)}_i = A^{(d)}_i V^{(d)}_i$

The outputs $\{Y^{(d)}\}$ are then fused using an adaptive gating network: $g = \mathrm{softmax}(W_g\,\mathrm{GAP}(X)) \in \mathbb{R}^{k+1},\quad \sum_{d} g_d = 1$

$Y_{\mathrm{fused}} = \sum_{d\in D \cup \{1\}} g_d\, \odot\, Y^{(d)}$

This output is propagated through a shared two-layer MLP, yielding the final block output (Ibtehaz et al., 2024, Ibtehaz et al., 2024).

3. Architectural and Implementation Choices

ACC-ViT stacks blocks alternating between Atrous Multi-Branch Attention and Atrous Multi-Branch Inverted Residual Convolution (MBConv) layers. In the MBConv branch, the 3×3 depthwise convolution is replaced by parallel atrous convolutions with dilations $d=1,2,3$ , each gated and then fused before channel reduction and residual addition. Both attention and convolution branches use adaptive gating, and all block outputs are fused via a single shared MLP.

Key architectural parameters include:

Embedding dimension per stage: $\{64, 96, 192, 384, 768\}$ (depending on variant).
Head count: typically $\{2,4,8,16\}$ for Tiny, up to $\{3,6,12,24\}$ for Base.
Window size: $w=7$ for all stages.
Dilation sets ( $D$ ): largest ( $\{2,4,8\}$ ) at highest spatial resolution, reduced incrementally in subsequent stages (Ibtehaz et al., 2024, Ibtehaz et al., 2024).

4. Training Regimes and Optimization

ACC-ViT models are trained on ImageNet-1K for 400 epochs using AdamW with a base learning rate of $1\times 10^{-3}$ (reduced for larger models), weight decay of $0.05$, and a cosine annealing schedule with 32-epoch linear warmup. Augmentation utilizes RandAugment, Mixup ( $\alpha=0.8$ ), CutMix ( $\alpha=0.8$ ), and label smoothing ($0.1$). Stochastic depth rates extend up to $0.3$ for larger variants, and an EMA of model weights is maintained every 32 steps. All projections employ Xavier initialization, except transformer projections, which use TorchVision defaults (Ibtehaz et al., 2024).

5. Benchmarking, Ablation, and Empirical Insights

Performance Comparison

On ImageNet-1K (224×224), ACC-ViT-Tiny ($28.4$M params, $5.7$G FLOPs) reaches $84.0\%$ top-1 accuracy, exceeding MaxViT-Tiny by $+0.35\%$ while using $9.1\%$ fewer parameters. ACC-ViT-Small ($62.9$M, $11.6$G) achieves $85.2\%$ top-1 ( $+0.15\%$ over MaxViT-Small with $9.7\%$ fewer params). All ACC-ViT variants dominate the full pareto front among ViTs and multi-axis transformer designs (Ibtehaz et al., 2024, Ibtehaz et al., 2024).

Ablation Studies

Analysis (nano variant) reveals the impact of the individual components:

Backbone Modification	Acc (%)
Single-branch MBConv + windowed attention	79.50
+ Atrous convolutions + multi-branch attention	81.52
+ Adaptive gating (conv and attn branches)	82.41
Replace shared MLP by ConvNeXt block	81.85

Adaptive gating of branches improves performance by $+0.89\%$ , and the shared MLP between branches is superior to ConvNeXt-style alternatives. The addition of parallel atrous convolutions and multi-branch attention jointly yield a significant performance increase ( $+2.02\%$ top-1 for the ablated configuration). The costs are limited: a $<10\%$ increase in parameters and FLOPs yields a $+2.9\%$ increase in top-1 accuracy (Ibtehaz et al., 2024, Ibtehaz et al., 2024).

6. Broader Applicability and Limitations

ACC-ViT demonstrates substantial advantages in downstream tasks beyond classification. On medical image analysis benchmarks (HAM10000, EyePACS, BUSI), ACC-ViT-T consistently outperforms Swin-T, ConvNeXt-T, and MaxViT-T. In frozen-backbone detection/segmentation (using Mask R-CNN + FPN on COCO), ACC-ViT-T beats MaxViT-T in both $\mathrm{AP_{bbox}}$ and $\mathrm{AP_{mask}}$ at multiple resolutions. In zero-shot vision-language retrieval (ELEVATER benchmark), ACC-ViT-T surpasses MaxViT-T and ConvNeXt-T on 13/20 tasks (Ibtehaz et al., 2024).

Limitations include reliance on fixed dilation rates, lack of scaling experiments to very large backbones or extremely high input resolutions, and the added inference overhead of the gating network. Further research into learnable or conditional dilation rates and additional efficiency optimizations are proposed as future directions (Ibtehaz et al., 2024).

7. Theoretical and Empirical Significance

ACC-ViT’s Atrous Multi-Branch formulation fundamentally bridges the gap between local-hierarchical and global-context modeling in vision transformers. By constructing aligned sets of dilated attention windows and fusing them via input-dependent gating, the model adaptively emphasizes informative context at every spatial position. This allows ACC-ViT not only to outperform pure windowed or pure grid/sparse models but also to surpass hybrid baselines such as MaxViT and ConvNeXt in parameter efficiency and accuracy. The design paradigm is extensible to other architectures and modalities wherever multiscale feature integration and adaptive spatial context are necessary (Ibtehaz et al., 2024, Ibtehaz et al., 2024).

Markdown Report Issue Upgrade to Chat

References (2)

Fusion of regional and sparse attention in Vision Transformers (2024)

ACC-ViT : Atrous Convolution's Comeback in Vision Transformers (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Atrous Multi-Branch Formulation (ACC-ViT).