Papers
Topics
Authors
Recent
Search
2000 character limit reached

Attention-Free Shift Transformers (VAST)

Updated 27 March 2026
  • The paper introduces affine‐shift blocks that replace softmax attention with shift, scale, and bias operations, reducing quadratic complexity while maintaining accuracy.
  • VAST extends these blocks to the video domain by employing spatiotemporal channel shifts, achieving 79.0–80.0% top-1 accuracy on benchmarks like Kinetics-400 with lower FLOPs.
  • Empirical results show that dynamic affine transformations integrated into shift operations boost performance by around 1.5% top-1 on ImageNet, balancing efficiency and effectiveness.

Attention-Free Shift Transformers comprise a class of neural architectures that achieve competitive performance with transformers while eliminating the quadratic-complexity softmax attention mechanism. These models, based on spatial or temporal “shift” operations, employ either parameter-free or affine-modulated channel shifting to facilitate information mixing among tokens. The resulting modules, such as the Affine-Shift block, enable efficient, attention-free transformers for both vision and video understanding, culminating in the Video Affine-Shift Transformer (VAST). VAST demonstrates that careful design of shift-based token mixers with affine operations can yield models that are both computationally lightweight and highly accurate on major benchmarks.

1. The Affine-Shift Block: Architecture and Rationale

The Affine-Shift block serves as the core unit of the Attention-Free Shift Transformer family. Its primary role is to replace Multi-Head Self-Attention (MHSA) with an attention-free, shift-based spatial and temporal mixing operation.

Given input Xl1RS×dX^{l-1} \in \mathbb{R}^{S \times d} at layer ll, the standard transformer update consists of:

  • Yl=MHSA(LN(Xl1))+Xl1Y^l = \mathrm{MHSA}(\mathrm{LN}(X^{l-1})) + X^{l-1}
  • Xl=MLP(LN(Yl))+YlX^l = \mathrm{MLP}(\mathrm{LN}(Y^l)) + Y^l

The Affine-Shift block modifies this structure:

  1. Applies a channel-mixing linear projection (WvW_v) to LayerNorm'ed input U=LN(Xl1)U = \mathrm{LN}(X^{l-1}).
  2. Splits the channels into groups; each group is shifted spatially (and, for video, temporally) by one position along its assigned axis, resulting in Zl=Shift(Vl,p,[b])Z^l = \mathrm{Shift}(V^l, p, [\mathbf{b}]) where Vl=UWvV^l = U W_v and pp indicates shifted channel count per axis.
  3. Computes a dynamic, input-dependent scale sls^l via global average pooling, a 2-layer MLP, and a sigmoid activation; and a dynamic bias blb^l via a depthwise 3×3 convolution over ZlZ^l.
  4. Forms the final mixed output via channel-wise affine transformation: Z^l=Zlsl+bl\hat{Z}^l = Z^l \odot s^l + b^l, followed by an output projection and residual connection: Yl=Z^lWh+Xl1Y^l = \hat{Z}^l W_h + X^{l-1}.

Key ablations reveal that both scale and bias are essential, providing approximately 1.5% top-1 accuracy improvement each on ImageNet (Bulat et al., 2022).

2. VAST: Video Affine-Shift Transformer Architecture

VAST generalizes the Affine-Shift paradigm to the video domain by enabling efficient token mixing across temporal as well as spatial axes. The architecture is a 4-stage hierarchical pyramid, analogously to Swin and MViT, but with all token mixing performed by Affine-Shift blocks.

Each VAST block shifts p=12dp = \frac{1}{2}d channels, with one-sixth allocated to each of the six shift directions (±t,±h,±w)(\pm t, \pm h, \pm w) for spatiotemporal coverage. Stages reduce spatial resolution and double channel depth after patch merging, mirroring standard hierarchical vision transformers.

Pseudocode for the main data path:

1
2
3
4
5
6
7
8
X = PatchEmbedding2D(V)      # X ∈ ℝ^{T×(H*W/16)×C1}
for stage in range(4):
    for block in range(L_stage[stage]):
        X = AffineShiftBlock(X)
    if stage < 3:
        X = PatchMerge(X)
f = GlobalPool(X)
logits = Classifier(f)
Inside the AffineShiftBlock, the shift, scale (SE-MLP), and bias (3×3 DWConv) are applied as described above (Bulat et al., 2022). Drop-path is disabled so no mixing step is skipped.

3. Empirical Performance and Efficiency

VAST and its affine-shift predecessors have demonstrated strong empirical results across image and video domains without attention:

  • ImageNet-1k Classification:
    • AST-Ti: 81.8% top-1, 19M params, 3.9G FLOPs.
    • AST-S: 82.8% top-1, 38M params, 6.8G FLOPs.
    • The affine-shift design approaches or exceeds the accuracy-efficiency trade-off of attention-based models in similar parameter and compute regimes (Bulat et al., 2022).
  • Kinetics-400 Video Recognition:
    • VAST-Ti-16: 79.0% top-1 @ 196G FLOPs.
    • VAST-S-16: 80.0% top-1 @ 338G FLOPs.
    • Competing transformer models (MViT-B: 64.7% top-1 @ 211G, XViT-B: 66.2% @ 850G, Swin-T: 69.6% @ 963G) are outperformed by VAST at lower computational cost.
  • Something-Something-v2 (SSv2) Action Recognition:
    • VAST-Ti-32: 69.3% top-1, 196G FLOPs.
    • VAST-S-32: 70.9% top-1, 338G FLOPs.
    • VAST achieves within 1–2% of the top accuracy of much heavier models (e.g. XViT-B, ViViT-L), but requires 2–4× fewer FLOPs (Bulat et al., 2022).

These results indicate that affine-shift blocks provide a mechanistically distinct and computationally efficient route to deep contextual modeling in both images and videos.

4. Analysis: Shift Operation Mechanics and Channel Allocation

The shift operation itself is parameter-free and computationally lightweight. For each block:

  • A specified fraction of channels is shifted by exactly one position along an assigned axis, with remaining channels unshifted.
  • In VAST, the shift covers both spatial and temporal axes, splitting p=12dp = \frac{1}{2}d channels into six equal groups for (+t,t,+h,h,+w,w)(+t, -t, +h, -h, +w, -w).
  • Empirical analysis shows that shifting between 25% and 50% of channels is optimal, with negligible sensitivity within that range (ImageNet: 81.5–81.8% top-1) (Bulat et al., 2022).

Unlike attention, which computes pairwise interactions, the shift enforces strictly local (nearest-neighbor or adjacent frame) communication. Stacking many such blocks ensures a full receptive field.

5. Training, Implementation, and Ablation

VAST adopts standard transformer training recipes but with several notable implementation details:

  • AdamW optimizer, weight decay 0.05, cosine learning rate decay, and standard augmentations (random crops, MixUp, CutMix, etc.).
  • Drop-path regularization is not applied to Affine-Shift, as skipping the shift operation obliterates any token mixing.
  • Models are typically pretrained on ImageNet-1k, with weights transferred to the video recognition task (Kinetics-400/600, SSv2, EpicKitchens-100).
  • Patch embedding is 2D for Kinetics/EpicKitchens, 3D for SSv2.

Ablation studies demonstrate the criticality of the affine components (scale, bias) within the shift block. Removing either reduces performance by ≈1.5% top-1, demonstrating that the affine components are necessary to approximate the effect of match-specific reweightings present in softmax-based attention (Bulat et al., 2022).

6. Comparison with Other Attention-Free Mixers

Attention-Free Shift Transformers, and VAST in particular, stand in contrast with other families of efficient token mixers:

Model Type Token Mixing Kernel Parameters FLOPs Coverage
ShiftViT 1-pixel shift 0 (shift) 0 Local
Affine-Shift Shift + scale/bias 2 MLPs, DWConv low Local
HSM Hierarchical shifts varies linear Global (log layers)
Softmax Attention Dot-product + softmax high quadratic Global

VAST and its predecessors are unique in offering a scale–bias composite operation strictly on local neighborhoods but achieve a receptive field comparable to that of global attention via deep stacking. Unlike HSM or kernelized attention, VAST does not rely on densely computed, content-adaptive pairwise weights.

7. Significance, Limitations, and Outlook

The attention-free paradigm, as embodied by VAST, demonstrates that transformer-style deep context modeling does not inherently require either dense or sparse attention kernels. Rather, lightweight shift-based modules, augmented with dynamic scale and bias, suffice for high-accuracy image and video recognition at a fraction of the computational burden (Bulat et al., 2022).

However, while affine-shift blocks approximate the channel- and sample-adaptivity of attention, there remains a measurable gap relative to the very best dense-attention models in the absolute regime. A plausible implication is that globally content-normalized mixing may benefit certain tasks or ultra-deep models.

Extensions to the affine-shift architecture include deeper hierarchies, alternate dynamic reweighting (e.g., gating), or interleaving with selective attention layers for hybrid architectures. The architectural design of VAST is readily compatible with mainstream vision transformer pipelines and can serve as a drop-in alternative for scenarios in which efficiency is paramount.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Attention-Free Shift Transformers (VAST).