Sliceformer: Efficient Transformer Models

Updated 7 October 2025

Sliceformer is a family of transformer architectures employing slicing, recursive refinement, and sorting to enhance computational efficiency and adaptability.
It incorporates innovations such as sliced group self-attention and slimmable channel slicing, reducing FLOPs by up to 30% while maintaining high performance in vision and language tasks.
Empirical results on benchmarks like ImageNet-1K and LRA validate its potential for scalable deployment across devices and diverse applications.

Sliceformer refers to a family of neural network architectures and methodologies that achieve efficient, flexible, and scalable inference within transformer-based models by employing slicing, sorting, and recursive operations. Various studies have explored different approaches under the Sliceformer name, focusing on improving computational efficiency, parameter utilization, and adaptability for vision and discriminative tasks. Key developments include the Sliced Recursive Transformer (SReT) (Shen et al., 2021), the slicing-sorting attention paradigm (Yuan et al., 2023), and the slimmable slicing design for flexible inference (Zhang et al., 6 Dec 2024).

1. Sliced Recursive Structures in Vision Transformers

The Sliced Recursive Transformer (SReT) introduces recursive operations within transformer blocks, where weights are shared across multiple recursions instead of having uniquely parameterized layers. If $\mathcal{F}_{l-1}$ is a transformer block and $z_{l-1}$ the input, then a naive recursion is

$z_l = \mathcal{F}_{l-1}(\mathcal{F}_{l-1}(z_{l-1}))$

To avoid degenerate identity mappings, SReT inserts a Non-Linear Projection Layer (NLL) between recursive steps:

$\mathrm{NLL}(z_{l-1}) = \mathrm{MLP}(\mathrm{LN}(z_{l-1}')) + z_{l-1}'$

A block with two recursions is:

$z_l = \mathrm{NLL}_2(\mathcal{F}_{l-1}(\mathrm{NLL}_1(\mathcal{F}_{l-1}(z_{l-1}))))$

This recursive refinement, with shared parameters, yields improved feature extraction without additional parameters, enabling deep networks (100–1000 layers) with compact model size (13–15M).

2. Sliced Group Self-Attention and Computational Efficiency

Recursive loops increase computational cost due to multiple self-attention passes. SReT mitigates this by employing "sliced" group self-attention—partitioning the token sequence into $G_l$ non-overlapping groups and computing attention within each. Let $C_{V\text{-}SA}$ denote the vanilla self-attention cost ( $O(L^2 \cdot D)$ ), then the group attention cost is

$C_{G\text{-}SA} = \frac{N_l}{G_l} \cdot C_{V\text{-}SA}$

When $N_l = G_l$ , FLOPs match vanilla self-attention; if $N_l < G_l$ , cost further drops. This yields 10–30% reduced FLOPs with minimal performance loss. Cross-group information mixing is achieved by introducing permutation and inverse permutation after group computations.

3. Slicing-Sorting Attention Mechanism

Sliceformer (Yuan et al., 2023) replaces the traditional multi-head attention (MHA)—with its $O(DN^2)$ complexity and softmax normalization—with a slicing-sorting operation, eliminating the "query-key-value" structure. The process is:

Linearly project the input $X \in \mathbb{R}^{N \times d}$ to $V = X W_V$ .
Each column $v_i$ of $V$ is treated as a "slice."
Sort each slice, generating permutation matrices $P_i$ so that

$\text{SliceSort}(X) := \text{Sort}_{col}(V) = \text{Concat}_{col}( \{P_i \cdot v_i\}_{i=1}^{MD} )$

Permutation matrices are sparse, full rank, and doubly stochastic, so implicit attention maps are well-structured and numerically stable.

Variants include:

Max-Exchange: Swaps maximal value to the front, $O(N)$ per slice.
Order-Interleave: Uses layer- and channel-dependent ascending/descending sorting via

$\psi_n(i) = \sin(2^{L-n} \pi \frac{i}{MD})$

and applies sorting direction accordingly across slices for attention diversity.

4. Flexible Slicing for Inference Adaptation

The Scala framework (Zhang et al., 6 Dec 2024) addresses inference flexibility by enabling a single Vision Transformer (ViT) to represent multiple sub-networks with different channel widths. Smaller ViTs are realized as sub-networks of a larger ViT using channel-wise slicing. The width ratio $r$ parameterizes slicing, with $r \in [s, l]$ , $s$ minimal, $l$ maximal.

Key Scala mechanisms:

Isolated Activation: To prevent over-activation of the smallest sub-network during sandwich-rule training, channels of the smallest variant ( $s$ ) use reverse slicing:

$\theta^{(r)} = \theta[:(r C_o), :(r C_i)]$

for $r \neq s$ , while

$\theta^{(s)} = \theta[-(s C_o):, -(s C_i):]$

isolates $F^{(s)}$ from others.

Scale Coordination: Combines Progressive Knowledge Transfer (PKT), stable sampling, and noise calibration. Multiple sub-networks (smallest, largest, intermediates) are sampled, and PKT leverages teacher assistants to distill representations stagewise using KL-divergence losses:

$\mathcal{L}_{KL}^{(r)} = -\sum_k p_{dis}^{(r')}\log\left(\frac{p_{dis}^{(r)}}{p_{dis}^{(r')}}\right)$

5. Empirical Results and Performance Benchmarks

On ImageNet-1K, naïve recursion in DeiT-Tiny (5.7M) improves Top-1 accuracy from 72.2% to ~74.0%.
With NLL and learnable residual, SReT reaches 76–77%.
SReT-S and SReT-TL outperform Swin Transformer, T2T-ViT, and others at lower parameter/FLOPs.

Achieves comparable/superior classification performance on Long Range Arena (LRA) benchmarks; performs well on ListOps, text/image classification, and molecular property prediction.
Reduces computational complexity to $O(D N \log N)$ and peak memory cost.
In Vision Sliceformer with ViT backbone, outperforms classic Transformer in discriminative tasks.

Delivers 1.6% average improvement on ImageNet-1K over prior scalable/US-Net approaches.
Matches Separate Training with only one training run and fewer parameters.
Effective for slimmable deployment on resource-constrained devices.

Model Variant	Accuracy (%)	Params (M)	FLOPs Reduction (%)
DeiT-Tiny baseline	72.2	5.7	0
SReT-Recursion	~74.0	5.7	-
SReT-S/NLL+LRC	76–77	13–15	10–30
Scala (sliced ViT)	+1.6 (vs US-Net)	Various	Reduced

6. Applications and Applicability

Sliceformer approaches are widely applicable across discriminative tasks:

Image Classification: Efficient parameter sharing and slicing-sorting yield high throughput, reduced memory, and strong accuracy (ImageNet-1K, CIFAR variants).
Text Classification: Superior to classic Transformer baselines in accuracy and efficiency.
Molecular Property Prediction: Reduces parameters by ~30% in Graphormer with competitive performance.
Machine Translation: Recursive, group-attention SReT variants prove effective on WMT14 En-De and IWSLT’14 De-En.
Dynamic Deployment: Scala’s slimmable slicing allows fast adaptation to computational constraints on edge/mobile devices, with robust accuracy at low widths.

These methods are general enough to integrate orthogonally with numerous transformer-based backbones, hybrid models (CNN-ViT), token pruning, and dense tasks (semantic segmentation, video recognition), fostering adaptable “foundation models” for computer vision.

7. Limitations and Future Directions

Current Sliceformer designs impose limitations on representation power:

Slicing-sorting’s use of permutation matrices restricts expressive capacity compared to softmax attention, which may limit generative modeling and very large-scale applications.
Mode collapse risk is empirically lower, but mechanisms to render slicing-sorting differentiable and fully learnable are not yet established.
Interpolation ability at unseen width/scale ratios in slimmable models remains challenging.

Future research aims include:

Developing differentiable, learnable slicing mechanisms for generative modeling power.
Extending to structured data domains (e.g., point clouds, 3D meshes).
Further theoretical analysis of permutation-based attention’s representational bounds.
Combining adaptive slicing approaches with token pruning and dynamic resolution for enhanced flexibility and efficiency.

A plausible implication is that Sliceformer architectures could catalyze broader adoption of dynamic, efficient transformer models for vision, language, and multimodal tasks in heterogeneous deployment environments.

PDF Markdown Chat (Pro)

References (3)

Sliced Recursive Transformer (2021)

Sliceformer: Make Multi-head Attention as Simple as Sorting in Discriminative Tasks (2023)

Slicing Vision Transformer for Flexible Inference (2024)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Sliceformer.

Sliceformer: Efficient Transformer Models

1. Sliced Recursive Structures in Vision Transformers

2. Sliced Group Self-Attention and Computational Efficiency

3. Slicing-Sorting Attention Mechanism

4. Flexible Slicing for Inference Adaptation

Key Scala mechanisms:

5. Empirical Results and Performance Benchmarks

SReT (Shen et al., 2021):

Sliceformer (Yuan et al., 2023):

Scala (Zhang et al., 6 Dec 2024):

6. Applications and Applicability

7. Limitations and Future Directions

Whiteboard

Follow Topic

Continue Learning

Sliceformer: Efficient Transformer Models

1. Sliced Recursive Structures in Vision Transformers

2. Sliced Group Self-Attention and Computational Efficiency

3. Slicing-Sorting Attention Mechanism

4. Flexible Slicing for Inference Adaptation

Key Scala mechanisms:

5. Empirical Results and Performance Benchmarks

SReT (Shen et al., 2021):

Sliceformer (Yuan et al., 2023):

Scala (Zhang et al., 6 Dec 2024):

6. Applications and Applicability

7. Limitations and Future Directions

Sponsor

Whiteboard

Follow Topic

Continue Learning

Related Topics