Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 78 tok/s
Gemini 2.5 Pro 52 tok/s Pro
GPT-5 Medium 24 tok/s Pro
GPT-5 High 26 tok/s Pro
GPT-4o 120 tok/s Pro
Kimi K2 193 tok/s Pro
GPT OSS 120B 459 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

Sliceformer: Efficient Transformer Models

Updated 7 October 2025
  • Sliceformer is a family of transformer architectures employing slicing, recursive refinement, and sorting to enhance computational efficiency and adaptability.
  • It incorporates innovations such as sliced group self-attention and slimmable channel slicing, reducing FLOPs by up to 30% while maintaining high performance in vision and language tasks.
  • Empirical results on benchmarks like ImageNet-1K and LRA validate its potential for scalable deployment across devices and diverse applications.

Sliceformer refers to a family of neural network architectures and methodologies that achieve efficient, flexible, and scalable inference within transformer-based models by employing slicing, sorting, and recursive operations. Various studies have explored different approaches under the Sliceformer name, focusing on improving computational efficiency, parameter utilization, and adaptability for vision and discriminative tasks. Key developments include the Sliced Recursive Transformer (SReT) (Shen et al., 2021), the slicing-sorting attention paradigm (Yuan et al., 2023), and the slimmable slicing design for flexible inference (Zhang et al., 6 Dec 2024).

1. Sliced Recursive Structures in Vision Transformers

The Sliced Recursive Transformer (SReT) introduces recursive operations within transformer blocks, where weights are shared across multiple recursions instead of having uniquely parameterized layers. If Fl1\mathcal{F}_{l-1} is a transformer block and zl1z_{l-1} the input, then a naive recursion is

zl=Fl1(Fl1(zl1))z_l = \mathcal{F}_{l-1}(\mathcal{F}_{l-1}(z_{l-1}))

To avoid degenerate identity mappings, SReT inserts a Non-Linear Projection Layer (NLL) between recursive steps:

NLL(zl1)=MLP(LN(zl1))+zl1\mathrm{NLL}(z_{l-1}) = \mathrm{MLP}(\mathrm{LN}(z_{l-1}')) + z_{l-1}'

A block with two recursions is:

zl=NLL2(Fl1(NLL1(Fl1(zl1))))z_l = \mathrm{NLL}_2(\mathcal{F}_{l-1}(\mathrm{NLL}_1(\mathcal{F}_{l-1}(z_{l-1}))))

This recursive refinement, with shared parameters, yields improved feature extraction without additional parameters, enabling deep networks (100–1000 layers) with compact model size (13–15M).

2. Sliced Group Self-Attention and Computational Efficiency

Recursive loops increase computational cost due to multiple self-attention passes. SReT mitigates this by employing "sliced" group self-attention—partitioning the token sequence into GlG_l non-overlapping groups and computing attention within each. Let CV-SAC_{V\text{-}SA} denote the vanilla self-attention cost (O(L2D)O(L^2 \cdot D)), then the group attention cost is

CG-SA=NlGlCV-SAC_{G\text{-}SA} = \frac{N_l}{G_l} \cdot C_{V\text{-}SA}

When Nl=GlN_l = G_l, FLOPs match vanilla self-attention; if Nl<GlN_l < G_l, cost further drops. This yields 10–30% reduced FLOPs with minimal performance loss. Cross-group information mixing is achieved by introducing permutation and inverse permutation after group computations.

3. Slicing-Sorting Attention Mechanism

Sliceformer (Yuan et al., 2023) replaces the traditional multi-head attention (MHA)—with its O(DN2)O(DN^2) complexity and softmax normalization—with a slicing-sorting operation, eliminating the "query-key-value" structure. The process is:

  1. Linearly project the input XRN×dX \in \mathbb{R}^{N \times d} to V=XWVV = X W_V.
  2. Each column viv_i of VV is treated as a "slice."
  3. Sort each slice, generating permutation matrices PiP_i so that

SliceSort(X):=Sortcol(V)=Concatcol({Pivi}i=1MD)\text{SliceSort}(X) := \text{Sort}_{col}(V) = \text{Concat}_{col}( \{P_i \cdot v_i\}_{i=1}^{MD} )

Permutation matrices are sparse, full rank, and doubly stochastic, so implicit attention maps are well-structured and numerically stable.

Variants include:

  • Max-Exchange: Swaps maximal value to the front, O(N)O(N) per slice.
  • Order-Interleave: Uses layer- and channel-dependent ascending/descending sorting via

ψn(i)=sin(2LnπiMD)\psi_n(i) = \sin(2^{L-n} \pi \frac{i}{MD})

and applies sorting direction accordingly across slices for attention diversity.

4. Flexible Slicing for Inference Adaptation

The Scala framework (Zhang et al., 6 Dec 2024) addresses inference flexibility by enabling a single Vision Transformer (ViT) to represent multiple sub-networks with different channel widths. Smaller ViTs are realized as sub-networks of a larger ViT using channel-wise slicing. The width ratio rr parameterizes slicing, with r[s,l]r \in [s, l], ss minimal, ll maximal.

Key Scala mechanisms:

  • Isolated Activation: To prevent over-activation of the smallest sub-network during sandwich-rule training, channels of the smallest variant (ss) use reverse slicing:

θ(r)=θ[:(rCo),:(rCi)]\theta^{(r)} = \theta[:(r C_o), :(r C_i)]

for rsr \neq s, while

θ(s)=θ[(sCo):,(sCi):]\theta^{(s)} = \theta[-(s C_o):, -(s C_i):]

isolates F(s)F^{(s)} from others.

  • Scale Coordination: Combines Progressive Knowledge Transfer (PKT), stable sampling, and noise calibration. Multiple sub-networks (smallest, largest, intermediates) are sampled, and PKT leverages teacher assistants to distill representations stagewise using KL-divergence losses:

LKL(r)=kpdis(r)log(pdis(r)pdis(r))\mathcal{L}_{KL}^{(r)} = -\sum_k p_{dis}^{(r')}\log\left(\frac{p_{dis}^{(r)}}{p_{dis}^{(r')}}\right)

5. Empirical Results and Performance Benchmarks

  • On ImageNet-1K, naïve recursion in DeiT-Tiny (5.7M) improves Top-1 accuracy from 72.2% to ~74.0%.
  • With NLL and learnable residual, SReT reaches 76–77%.
  • SReT-S and SReT-TL outperform Swin Transformer, T2T-ViT, and others at lower parameter/FLOPs.
  • Achieves comparable/superior classification performance on Long Range Arena (LRA) benchmarks; performs well on ListOps, text/image classification, and molecular property prediction.
  • Reduces computational complexity to O(DNlogN)O(D N \log N) and peak memory cost.
  • In Vision Sliceformer with ViT backbone, outperforms classic Transformer in discriminative tasks.
  • Delivers 1.6% average improvement on ImageNet-1K over prior scalable/US-Net approaches.
  • Matches Separate Training with only one training run and fewer parameters.
  • Effective for slimmable deployment on resource-constrained devices.
Model Variant Accuracy (%) Params (M) FLOPs Reduction (%)
DeiT-Tiny baseline 72.2 5.7 0
SReT-Recursion ~74.0 5.7 -
SReT-S/NLL+LRC 76–77 13–15 10–30
Scala (sliced ViT) +1.6 (vs US-Net) Various Reduced

6. Applications and Applicability

Sliceformer approaches are widely applicable across discriminative tasks:

  • Image Classification: Efficient parameter sharing and slicing-sorting yield high throughput, reduced memory, and strong accuracy (ImageNet-1K, CIFAR variants).
  • Text Classification: Superior to classic Transformer baselines in accuracy and efficiency.
  • Molecular Property Prediction: Reduces parameters by ~30% in Graphormer with competitive performance.
  • Machine Translation: Recursive, group-attention SReT variants prove effective on WMT14 En-De and IWSLT’14 De-En.
  • Dynamic Deployment: Scala’s slimmable slicing allows fast adaptation to computational constraints on edge/mobile devices, with robust accuracy at low widths.

These methods are general enough to integrate orthogonally with numerous transformer-based backbones, hybrid models (CNN-ViT), token pruning, and dense tasks (semantic segmentation, video recognition), fostering adaptable “foundation models” for computer vision.

7. Limitations and Future Directions

Current Sliceformer designs impose limitations on representation power:

  • Slicing-sorting’s use of permutation matrices restricts expressive capacity compared to softmax attention, which may limit generative modeling and very large-scale applications.
  • Mode collapse risk is empirically lower, but mechanisms to render slicing-sorting differentiable and fully learnable are not yet established.
  • Interpolation ability at unseen width/scale ratios in slimmable models remains challenging.

Future research aims include:

  • Developing differentiable, learnable slicing mechanisms for generative modeling power.
  • Extending to structured data domains (e.g., point clouds, 3D meshes).
  • Further theoretical analysis of permutation-based attention’s representational bounds.
  • Combining adaptive slicing approaches with token pruning and dynamic resolution for enhanced flexibility and efficiency.

A plausible implication is that Sliceformer architectures could catalyze broader adoption of dynamic, efficient transformer models for vision, language, and multimodal tasks in heterogeneous deployment environments.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (3)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Sliceformer.