Papers
Topics
Authors
Recent
2000 character limit reached

Sliced Recursive Transformers (SReT)

Updated 9 December 2025
  • The SReT framework is a recursive vision transformer that reuses shared MHSA and FFN weights to significantly increase virtual depth with minimal additional parameters.
  • It employs interleaved non-shared NLL projections to prevent degenerate identity mappings, ensuring effective training even with deep configurations.
  • Group-wise self-attention slicing in SReT reduces FLOPs by up to 30% while maintaining accuracy, as demonstrated on benchmarks like ImageNet-1K.

The Sliced Recursive Transformer (SReT) is a parameter-efficient architectural framework for vision transformers (ViT) that introduces recursive operations and a grouped self-attention approximation to achieve greater representational depth and efficiency without significant increase in parameters or computational overhead. The key innovations are the recursive weight-sharing mechanism across transformer blocks and the introduction of group-wise self-attention, which together permit the construction of deep, highly parameter-efficient transformers suitable for large-scale image recognition tasks and beyond (Shen et al., 2021).

1. Recursive Operation and Block Design

The SReT architecture is based on recursively applying a single transformer block’s weights multiple times within each depth slot. A standard transformer block is defined as

z=MHSA(LN(z1))+z1,z=FFN(LN(z))+z\mathbf{z}_{\ell}' = \mathrm{MHSA}\big(\mathrm{LN}(\mathbf{z}_{\ell-1})\big) + \mathbf{z}_{\ell-1}, \quad \mathbf{z}_\ell = \mathrm{FFN}\big(\mathrm{LN}(\mathbf{z}_{\ell}')\big) + \mathbf{z}_{\ell}'

where MHSA denotes multi-head self-attention, and FFN denotes a feedforward subnetwork. In SReT, at the \ell-th depth slot, the same composite block F\mathcal{F} is iteratively applied BB times, with independent non-shared MLP projection layers (NLLs) interleaved to prevent degenerate identity transformations. The iteration proceeds as follows: $\mathbf{h}^{(0)} = \mathbf{z}_{\ell-1};\quad \mathbf{h}^{(i)} = \begin{cases} \mathcal{F}(\mathbf{h}^{(i-1)}), & \text{if ~%%%%3%%%%~odd} \ \mathrm{NLL}(\mathbf{h}^{(i-1)}), & \text{if ~%%%%4%%%%~even} \end{cases}$ The final output is z=h(B)\mathbf{z}_\ell = \mathbf{h}^{(B)}. Across the BB loops, the MHSA and FFN weights are tied, while each intervening NLL uses its own parameters.

The depth of the network is thus effectively increased (virtual depth LBL \cdot B for LL base blocks and BB recursions), but the number of unique parameter sets grows only as L(MHSA+FFN)+L(B1)NLLL \cdot (\text{MHSA} + \text{FFN}) + L \cdot (B-1) \cdot \text{NLL}, where the NLL is typically much smaller than the MHSA or FFN. In practice, B=2B=2 is sufficient for a 2×\times deeper network at only minimal additional parameter cost.

2. Weight Sharing and Parameter Efficiency

The recursive weight-sharing mechanism in SReT ensures that the majority of network parameters (i.e., the MHSA and FFN) are reused across multiple virtual layers. Specifically, each recursive block at position \ell maintains one set of MHSA weights ({WQ,WK,WV,WO}\left\{ W^Q, W^K, W^V, W^O \right\}) and FFN weights ({W1,b1,W2,b2}\{ W_1, b_1, W_2, b_2 \}), both shared across BB recursions. Only the interleaved NLLs are endowed with unique weights per recursion.

Parameter efficiency is thus greatly increased: relative to a standard LL-layer ViT with PP parameters, the SReT with BB recursions has roughly P+L(B1)PNLLP + L (B-1) P_{\mathrm{NLL}}, which is typically under $1.1P$. This allows highly compact models (e.g., 13–15M parameters for over 100 or even 1000 shared layers) and mitigates the optimization difficulties typically encountered at extreme depths (Shen et al., 2021).

3. Sliced Group Self-Attention Approximation

Naive recursion increases both the representational depth and computational FLOPs. To decouple these, SReT introduces a group-wise self-attention approximation, partitioning tokens (or channels) into groups and running smaller multi-head self-attentions in each recursion loop. For a single block at depth \ell, with sequence length LL_\ell and hidden dimension DD_\ell, the global MHSA cost is

CV-SA=O(L2D)C_{\mathrm{V\text{-}SA}} = O(L_\ell^2 D_\ell)

When the attention is split into GG_\ell groups per recursion, the cost becomes

CG-SA=NGO((L/G)2(D/G))=O(NGL2D)C_{\mathrm{G\text{-}SA}} = N_\ell G_\ell O\Big((L_\ell / G_\ell)^2 (D_\ell / G_\ell)\Big) = O\Big(\frac{N_\ell}{G_\ell} L_\ell^2 D_\ell\Big)

where NN_\ell is the number of recursion loops (typically BB). With N=GN_\ell = G_\ell, the cost matches the global attention; if G>NG_\ell > N_\ell, the cost drops below the global baseline, with a trade-off in possible representation capacity if groups are too small.

A critical implementation is the permutation of token order before grouping, ensuring global information flow via alternating, permuted group attentions. Empirically, configurations such as [8,2][8, 2] groups for two recursions achieve \sim10–30% FLOPs reduction at equivalent accuracy (Shen et al., 2021).

4. Computational Efficiency and Scalability

Computational overhead in SReT is a function of recursion depth and group size. For FbaseF_{\mathrm{base}} FLOPs in a standard ViT block and group fraction r=1/Gr = 1/G, recursion with group-wise MHSA incurs approximately BrFbaseB \cdot r \cdot F_{\mathrm{base}}, with additional negligible permutation overhead. When r<1r < 1 and B>1B > 1, this yields up to 30% computational savings.

Scaling properties are enhanced by internal recursion (repeating each block’s own weights) rather than external loops (cycling all layers through a global recursion). SReT is empirically stable and trainable up to >100>100 or even >1000>1000 virtual layers, enabled by learnable residual scalars (LRCs) applied to both MHSA and FFN branches: z=αMHSA(LN(z))+βz,z+=γFFN(LN(z))+δz\mathbf{z}' = \alpha\,\mathrm{MHSA}(\mathrm{LN}(\mathbf{z})) + \beta\,\mathbf{z}, \quad \mathbf{z}^+ = \gamma\,\mathrm{FFN}(\mathrm{LN}(\mathbf{z}')) + \delta\,\mathbf{z}' with α,β,γ,δ\alpha,\beta,\gamma,\delta learned parameters initialized to 1.

Additionally, mixed-depth configurations with shallow “non-shared” and deep “shared” branches improve optimization landscapes and provide direct deep supervision.

5. Empirical Performance and Ablation Results

SReT establishes strong empirical results on ImageNet-1K with models that match or exceed contemporary ViT architectures but with reduced parameter and compute budgets. Selected configurations:

Model Params (M) FLOPs (B) Top-1 (%)
DeiT-Tiny 5.7 1.3 72.2
PiT-Tiny 4.9 0.7 73.0
SReT-ExT 4.0 0.7 74.0
SReT-Tiny 4.8 1.1 76.0
Swin-Tiny 29.0 4.5 81.3
SReT-Small 20.9 4.2 81.9

Ablation studies on DeiT-Tiny show that inclusion of NLL increases performance by +2.5pp, and LRC yields a further +0.6pp improvement. SReT with group MSA and recursion can reduce parameter count by 18% (e.g., SReT-ExT) while climbing 1pp in accuracy with constant FLOPs.

The recursive design is also portable: applying it to MLP-Mixer (all-MLP architectures) produces similar gains, and on WMT14 En–De machine translation, SReT improves over standard transformers by +0.4 BLEU with more stable training, particularly with LRC (Shen et al., 2021).

6. Complexity Analysis

Let LL denote blocks, BB recursion loops, DD hidden dimension, SS sequence length, and GG group count.

  • Standard ViT: Complexity O(LS2D)O(L S^2 D)
  • Vanilla Recursive SReT (no slicing): Complexity O(LBS2D)O(L B S^2 D)
  • Sliced Group SReT: Complexity O(LBS2D/G)O(L B S^2 D / G)

Parameter complexity for SReT is O(LD2+L(B1)DNLL2)O(LD2)O(L D^2 + L (B-1) D^2_{\mathrm{NLL}}) \approx O(L D^2), close to standard ViT.

Thus, for B=GB=G, computational complexity is comparable to standard ViT, while G>BG > B affords additional savings. This enables extremely deep yet efficient transformer architectures, highlighting the advantage of SReT's design for large-scale and resource-constrained applications.

7. Significance and Prospects

SReT demonstrates the practicality of recursive weight sharing and group-wise self-attention in vision transformers, achieving state-of-the-art performance with substantial reductions in FLOPs (10–30%) and parameters (15–30%). The architecture supports construction and effective training of models with virtual depths exceeding 1000 layers, a critical milestone for scalable and efficient transformer-based vision models. The methods introduced in SReT are broadly compatible with existing efficient ViT variants, suggesting immediate applicability and extensibility in architectures targeting diverse domains that require parameter efficiency and depth. The public codebase is available at https://github.com/szq0214/SReT (Shen et al., 2021).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Sliced Recursive Transformers (SReT).