ACC-ViT: Atrous Multi-Branch Vision Transformer
- The paper’s main contribution is the novel fusion of regional and sparse attention via an atrous multi-branch architecture.
- It employs parallel dilated attention and convolution branches with adaptive gating, resulting in enhanced accuracy and parameter efficiency.
- Empirical results on ImageNet and medical benchmarks demonstrate that ACC-ViT outperforms traditional CNNs and transformer-only models.
The Atrous Multi-Branch Formulation (ACC-ViT) is a contemporary architectural paradigm in vision transformer design in which multi-scale local and global context is encapsulated using a spatial pyramid of dilated (“atrous”) attention windows and convolutions. This approach directly fuses regional (windowed) and sparse (grid) attention schemes by deploying learnable, parallelized attention branches at multiple dilation levels and adaptively fusing their outputs via a gating mechanism. The design yields a hybrid architecture, consistently surpassing both pure CNN and transformer-only models across a range of vision benchmarks, including classification, detection, segmentation, and zero-shot tasks (Ibtehaz et al., 2024, Ibtehaz et al., 2024).
1. Principle of Atrous Multi-Branch Formulation
At the core of ACC-ViT is the intuition that vision models must simultaneously capture fine-grained details and long-range dependencies. Standard regional (windowed) attention restricts interactions to local non-overlapping regions of the image, effectively preserving hierarchical spatial structure but at the cost of losing global contextual information. In contrast, sparse (grid-based) attention enables broader context aggregation by interacting across distant image regions, but can dilute essential locality and introduce irrelevant dependencies.
ACC-ViT resolves this duality by constructing parallel sets of feature windows on the input tensor : an undilated partition () and dilated (stride-) partitions ( in early stages). Each partition undergoes standard windowed multi-head self-attention (W-MHSA), and the per-branch outputs are adaptively gated and fused before further processing by a shared MLP (Ibtehaz et al., 2024, Ibtehaz et al., 2024).
2. Formal Definition and Mathematical Structure
Regional (Windowed) Attention
Divide into non-overlapping windows of size . For each window :
Sparse (Grid) Attention
Sample on a coarse grid, stride : with .
Atrous Attention (Fusion)
For each , the atrous windowing operator extracts stride- sampled windows: Compute windowed attention in parallel:
The outputs are then fused using an adaptive gating network:
This output is propagated through a shared two-layer MLP, yielding the final block output (Ibtehaz et al., 2024, Ibtehaz et al., 2024).
3. Architectural and Implementation Choices
ACC-ViT stacks blocks alternating between Atrous Multi-Branch Attention and Atrous Multi-Branch Inverted Residual Convolution (MBConv) layers. In the MBConv branch, the 3×3 depthwise convolution is replaced by parallel atrous convolutions with dilations , each gated and then fused before channel reduction and residual addition. Both attention and convolution branches use adaptive gating, and all block outputs are fused via a single shared MLP.
Key architectural parameters include:
- Embedding dimension per stage: (depending on variant).
- Head count: typically for Tiny, up to for Base.
- Window size: for all stages.
- Dilation sets (): largest () at highest spatial resolution, reduced incrementally in subsequent stages (Ibtehaz et al., 2024, Ibtehaz et al., 2024).
4. Training Regimes and Optimization
ACC-ViT models are trained on ImageNet-1K for 400 epochs using AdamW with a base learning rate of (reduced for larger models), weight decay of $0.05$, and a cosine annealing schedule with 32-epoch linear warmup. Augmentation utilizes RandAugment, Mixup (), CutMix (), and label smoothing ($0.1$). Stochastic depth rates extend up to $0.3$ for larger variants, and an EMA of model weights is maintained every 32 steps. All projections employ Xavier initialization, except transformer projections, which use TorchVision defaults (Ibtehaz et al., 2024).
5. Benchmarking, Ablation, and Empirical Insights
Performance Comparison
On ImageNet-1K (224×224), ACC-ViT-Tiny ($28.4$M params, $5.7$G FLOPs) reaches top-1 accuracy, exceeding MaxViT-Tiny by while using fewer parameters. ACC-ViT-Small ($62.9$M, $11.6$G) achieves top-1 ( over MaxViT-Small with fewer params). All ACC-ViT variants dominate the full pareto front among ViTs and multi-axis transformer designs (Ibtehaz et al., 2024, Ibtehaz et al., 2024).
Ablation Studies
Analysis (nano variant) reveals the impact of the individual components:
| Backbone Modification | Acc (%) |
|---|---|
| Single-branch MBConv + windowed attention | 79.50 |
| + Atrous convolutions + multi-branch attention | 81.52 |
| + Adaptive gating (conv and attn branches) | 82.41 |
| Replace shared MLP by ConvNeXt block | 81.85 |
Adaptive gating of branches improves performance by , and the shared MLP between branches is superior to ConvNeXt-style alternatives. The addition of parallel atrous convolutions and multi-branch attention jointly yield a significant performance increase ( top-1 for the ablated configuration). The costs are limited: a increase in parameters and FLOPs yields a increase in top-1 accuracy (Ibtehaz et al., 2024, Ibtehaz et al., 2024).
6. Broader Applicability and Limitations
ACC-ViT demonstrates substantial advantages in downstream tasks beyond classification. On medical image analysis benchmarks (HAM10000, EyePACS, BUSI), ACC-ViT-T consistently outperforms Swin-T, ConvNeXt-T, and MaxViT-T. In frozen-backbone detection/segmentation (using Mask R-CNN + FPN on COCO), ACC-ViT-T beats MaxViT-T in both and at multiple resolutions. In zero-shot vision-language retrieval (ELEVATER benchmark), ACC-ViT-T surpasses MaxViT-T and ConvNeXt-T on 13/20 tasks (Ibtehaz et al., 2024).
Limitations include reliance on fixed dilation rates, lack of scaling experiments to very large backbones or extremely high input resolutions, and the added inference overhead of the gating network. Further research into learnable or conditional dilation rates and additional efficiency optimizations are proposed as future directions (Ibtehaz et al., 2024).
7. Theoretical and Empirical Significance
ACC-ViT’s Atrous Multi-Branch formulation fundamentally bridges the gap between local-hierarchical and global-context modeling in vision transformers. By constructing aligned sets of dilated attention windows and fusing them via input-dependent gating, the model adaptively emphasizes informative context at every spatial position. This allows ACC-ViT not only to outperform pure windowed or pure grid/sparse models but also to surpass hybrid baselines such as MaxViT and ConvNeXt in parameter efficiency and accuracy. The design paradigm is extensible to other architectures and modalities wherever multiscale feature integration and adaptive spatial context are necessary (Ibtehaz et al., 2024, Ibtehaz et al., 2024).