Sliceformer: Efficient Transformer Models
- Sliceformer is a family of transformer architectures employing slicing, recursive refinement, and sorting to enhance computational efficiency and adaptability.
- It incorporates innovations such as sliced group self-attention and slimmable channel slicing, reducing FLOPs by up to 30% while maintaining high performance in vision and language tasks.
- Empirical results on benchmarks like ImageNet-1K and LRA validate its potential for scalable deployment across devices and diverse applications.
Sliceformer refers to a family of neural network architectures and methodologies that achieve efficient, flexible, and scalable inference within transformer-based models by employing slicing, sorting, and recursive operations. Various studies have explored different approaches under the Sliceformer name, focusing on improving computational efficiency, parameter utilization, and adaptability for vision and discriminative tasks. Key developments include the Sliced Recursive Transformer (SReT) (Shen et al., 2021), the slicing-sorting attention paradigm (Yuan et al., 2023), and the slimmable slicing design for flexible inference (Zhang et al., 6 Dec 2024).
1. Sliced Recursive Structures in Vision Transformers
The Sliced Recursive Transformer (SReT) introduces recursive operations within transformer blocks, where weights are shared across multiple recursions instead of having uniquely parameterized layers. If is a transformer block and the input, then a naive recursion is
To avoid degenerate identity mappings, SReT inserts a Non-Linear Projection Layer (NLL) between recursive steps:
A block with two recursions is:
This recursive refinement, with shared parameters, yields improved feature extraction without additional parameters, enabling deep networks (100–1000 layers) with compact model size (13–15M).
2. Sliced Group Self-Attention and Computational Efficiency
Recursive loops increase computational cost due to multiple self-attention passes. SReT mitigates this by employing "sliced" group self-attention—partitioning the token sequence into non-overlapping groups and computing attention within each. Let denote the vanilla self-attention cost (), then the group attention cost is
When , FLOPs match vanilla self-attention; if , cost further drops. This yields 10–30% reduced FLOPs with minimal performance loss. Cross-group information mixing is achieved by introducing permutation and inverse permutation after group computations.
3. Slicing-Sorting Attention Mechanism
Sliceformer (Yuan et al., 2023) replaces the traditional multi-head attention (MHA)—with its complexity and softmax normalization—with a slicing-sorting operation, eliminating the "query-key-value" structure. The process is:
- Linearly project the input to .
- Each column of is treated as a "slice."
- Sort each slice, generating permutation matrices so that
Permutation matrices are sparse, full rank, and doubly stochastic, so implicit attention maps are well-structured and numerically stable.
Variants include:
- Max-Exchange: Swaps maximal value to the front, per slice.
- Order-Interleave: Uses layer- and channel-dependent ascending/descending sorting via
and applies sorting direction accordingly across slices for attention diversity.
4. Flexible Slicing for Inference Adaptation
The Scala framework (Zhang et al., 6 Dec 2024) addresses inference flexibility by enabling a single Vision Transformer (ViT) to represent multiple sub-networks with different channel widths. Smaller ViTs are realized as sub-networks of a larger ViT using channel-wise slicing. The width ratio parameterizes slicing, with , minimal, maximal.
Key Scala mechanisms:
- Isolated Activation: To prevent over-activation of the smallest sub-network during sandwich-rule training, channels of the smallest variant () use reverse slicing:
for , while
isolates from others.
- Scale Coordination: Combines Progressive Knowledge Transfer (PKT), stable sampling, and noise calibration. Multiple sub-networks (smallest, largest, intermediates) are sampled, and PKT leverages teacher assistants to distill representations stagewise using KL-divergence losses:
5. Empirical Results and Performance Benchmarks
SReT (Shen et al., 2021):
- On ImageNet-1K, naïve recursion in DeiT-Tiny (5.7M) improves Top-1 accuracy from 72.2% to ~74.0%.
- With NLL and learnable residual, SReT reaches 76–77%.
- SReT-S and SReT-TL outperform Swin Transformer, T2T-ViT, and others at lower parameter/FLOPs.
Sliceformer (Yuan et al., 2023):
- Achieves comparable/superior classification performance on Long Range Arena (LRA) benchmarks; performs well on ListOps, text/image classification, and molecular property prediction.
- Reduces computational complexity to and peak memory cost.
- In Vision Sliceformer with ViT backbone, outperforms classic Transformer in discriminative tasks.
Scala (Zhang et al., 6 Dec 2024):
- Delivers 1.6% average improvement on ImageNet-1K over prior scalable/US-Net approaches.
- Matches Separate Training with only one training run and fewer parameters.
- Effective for slimmable deployment on resource-constrained devices.
Model Variant | Accuracy (%) | Params (M) | FLOPs Reduction (%) |
---|---|---|---|
DeiT-Tiny baseline | 72.2 | 5.7 | 0 |
SReT-Recursion | ~74.0 | 5.7 | - |
SReT-S/NLL+LRC | 76–77 | 13–15 | 10–30 |
Scala (sliced ViT) | +1.6 (vs US-Net) | Various | Reduced |
6. Applications and Applicability
Sliceformer approaches are widely applicable across discriminative tasks:
- Image Classification: Efficient parameter sharing and slicing-sorting yield high throughput, reduced memory, and strong accuracy (ImageNet-1K, CIFAR variants).
- Text Classification: Superior to classic Transformer baselines in accuracy and efficiency.
- Molecular Property Prediction: Reduces parameters by ~30% in Graphormer with competitive performance.
- Machine Translation: Recursive, group-attention SReT variants prove effective on WMT14 En-De and IWSLT’14 De-En.
- Dynamic Deployment: Scala’s slimmable slicing allows fast adaptation to computational constraints on edge/mobile devices, with robust accuracy at low widths.
These methods are general enough to integrate orthogonally with numerous transformer-based backbones, hybrid models (CNN-ViT), token pruning, and dense tasks (semantic segmentation, video recognition), fostering adaptable “foundation models” for computer vision.
7. Limitations and Future Directions
Current Sliceformer designs impose limitations on representation power:
- Slicing-sorting’s use of permutation matrices restricts expressive capacity compared to softmax attention, which may limit generative modeling and very large-scale applications.
- Mode collapse risk is empirically lower, but mechanisms to render slicing-sorting differentiable and fully learnable are not yet established.
- Interpolation ability at unseen width/scale ratios in slimmable models remains challenging.
Future research aims include:
- Developing differentiable, learnable slicing mechanisms for generative modeling power.
- Extending to structured data domains (e.g., point clouds, 3D meshes).
- Further theoretical analysis of permutation-based attention’s representational bounds.
- Combining adaptive slicing approaches with token pruning and dynamic resolution for enhanced flexibility and efficiency.
A plausible implication is that Sliceformer architectures could catalyze broader adoption of dynamic, efficient transformer models for vision, language, and multimodal tasks in heterogeneous deployment environments.