Selective Spatiotemporal Vision Transformer (SSViT)
- SSViT is a vision architecture that leverages selective spatiotemporal attention to process dynamic visual data with high computational efficiency.
- It integrates biologically inspired spiking neural networks with deep learning techniques for tasks like image classification and modulo video recovery.
- The design reduces complexity by employing innovative token selection and self-attention modules, achieving superior accuracy with lower memory and FLOPs.
The Selective Spatiotemporal Vision Transformer (SSViT) refers to a class of architectures designed for efficient, high-accuracy spatiotemporal modeling in vision tasks. Two principal and distinct instantiations have been developed: one tailored for spiking neural networks with biological inspiration for edge computing, and another for deep learning-driven modulo video recovery using token selection strategies. Both directions are unified by their core strategy of selectively attending to crucial spatiotemporal regions, either via spike-driven mechanisms or through data-driven token selection, thereby achieving strong performance with significant computational and memory gains.
1. Architectural Foundations
Spiking SSViT (SNN-ViT with SSSA)
The spiking variant of SSViT is built upon two components:
- Global–Local Spiking Patch Splitting (GL-SPS): Transforms raw images into multi-scale spiking feature maps, partitioning input into patches suitable for spiking processing.
- Stacked Spiking Transformer Blocks: Each block includes a Saccadic Spike Self-Attention (SSSA) module, followed by a channel-wise MLP layer.
The architecture processes image data hierarchically in a 4-stage pyramid. At each subsequent stage, the spatial resolution halves and the channel dimension typically doubles. Data flows through the structure as: Input Image → GL-SPS → SSSA-Block → MLP → next stage.
Within each stage, tokens representing spatial patches across timesteps and channels are processed by SSSA for spatiotemporal mixing, then passed through the MLP. The spiking neuron follows the Leaky Integrate-and-Fire (LIF) model:
where is membrane potential, synaptic input, decay constant, threshold, spike, and is the Heaviside step.
Deep Learning SSViT for Modulo Video
In modulo video recovery, SSViT takes a window of low-bit observations produced by modulo cameras, with the goal of reconstructing the underlying high-dynamic-range (HDR) content. The pipeline consists of:
- Preprocessing: Sliding window of consecutive A-bit frames, extraction of folding masks and fold counts .
- Encoder: Shared CNN-style encoder produces a 4D feature map, further split into spatiotemporal tubes and projected into tokens.
- Token Selection: Intricate regions are located using a 3D Neighboring Similarity Matrix (NSM), and only tokens corresponding to highest NSM scores are processed by the Transformer backbone.
- Transformer: Joint space-time attention is performed over the selected tokens (from the target frame) and all tokens from supporting frames (for context).
- Decoder: Patchwise binary mask prediction for folding recovery.
Notably, all positional embeddings are omitted, justified empirically by the data-driven nature of the mask classification task.
2. Selective Attention Mechanisms
Saccadic Spike Self-Attention (SSSA)
Vanilla dot-product self-attention fails with spike-based, binary and sparse representations, due to magnitude fluctuations. SSSA replaces this via spike distribution-based spatial relevance:
- Each D-dimensional spike vector is modeled as a Bernoulli process ( for Query, for Key).
- Cross-entropy between :
The silent-period term is dropped, and is approximated linearly for spike rates near 0.1–0.2.
- For all tokens:
The cross-attention becomes:
Saccadic Interaction Module
Inspired by biological saccades, the attention mechanism dynamically selects spatial locations at each timestep using:
- Salient-patch scoring:
- Temporal salience accumulation via a learnable lower-triangular matrix :
- Training: ,
- Inference: Gating depends only on current and a dynamically adjusted threshold.
Token Selection via 3D NSM
For deep learning-based SSViT, regions likely to require non-trivial recovery are identified by heterogeneity in local features:
- NSM combines Kullback–Leibler divergence between softmax of local features and a uniform distribution, with average cosine dissimilarity:
- Only the top- tokens by average NSM are chosen for full attention, focusing resources where folding ambiguity is highest.
3. Computational Efficiency and Complexity Analysis
Spiking SSViT (SNN-ViT)
- Baseline self-attention has or higher complexity.
- SSSA exploits distribution-kernel factorizations, reducing to when .
- SSSA-V2 further linearizes computation by compressing kernel operations and thresholding, maintaining full spatial-temporal selectivity without quadratic overhead.
Modulo Video SSViT
- Token selection avoids processing all tokens, operating on tokens at full Transformer cost .
- Unselected tokens are handled by warping predicted masks with FlowNet2-inferred optical flow, circumventing repeat attention calculations.
- Feature encodings are cached, as frames are encoded once per iteration.
| Method/Architecture | Complexity | Memory/FLOPs Saving |
|---|---|---|
| Spiking SSViT (SSSA-V2) | Full spike-driven linear scaling | |
| Modulo Video SSViT (Selection) | %%%%2829%%%% vs. full attention |
4. Experimental Performance and Benchmarking
Spiking SSViT
- Image Classification (CIFAR100, T=4):
SNN-ViT achieves 80.1% accuracy with 5.6M parameters and complexity, outperforming Spikformer (78.2%, 9.3M, ) and Spike-driven ViTs (78.4%, 10.3M, ).
- ImageNet-1K (T=4):
SNN-ViT-8-512 achieves 80.2% Top-1 accuracy () with 35.8 mJ energy per sample, competitive with standard ViT-12-768 (77.9%, , 80.9 mJ).
- Remote-sensing Detection:
SNN-ViT as backbone in YOLO-v3 improves [email protected] on SSDD from 94.8 to 96.7% at .
Modulo Video SSViT
- Datasets:
Tested on LiU (12 sequences, 1280×720) and HdM (upto 1920×1080).
- Metrics:
- Qualitative Assessment:
In high-dynamic scenes, SSViT recovers fold edges and detail which prior methods tend to lose.
5. Training and Optimization Methodologies
- Spiking SSViT:
- Supervised cross-entropy loss on final membrane potentials.
- Surrogate gradients such as are applied for backpropagation through the Heaviside spiking function.
- Regularization via weight decay on and scheduling of or .
- Modulo Video SSViT:
- Standard cross-entropy loss is applied to binary folding mask prediction, accumulated over selected tokens at each iteration.
- Training is end-to-end with Adam optimizer (), 200k iterations, clip length.
- Outputs are temporally tone-mapped with a smoothed Reinhard operator.
6. Ablation Studies and Analytical Insights
- Incorporation of SSSA (spatial attention) improves CIFAR100 accuracy by ; GL-SPS (patch split) alone yields .
- Both SSSA and GL-SPS combined (with complexity) yield over baseline.
- Micro-ablation of SSSA: replacing the distribution kernel with dot-product or the saccadic neuron with LIF both reduce performance (–0.48% and –0.76%, respectively).
- Linearized SSSA-V2 delivers comparable accuracy to quadratic SSSA-V1.
| Variant | Params | Complexity | CIFAR100 Acc. |
|---|---|---|---|
| Spikformer (Baseline) | 9.32M | 78.21% | |
| + SSSA (spatial only) | 5.52M | 79.60% | |
| + GL-SPS (patch only) | 5.81M | 77.88% | |
| + Both | 5.57M | 80.10% |
7. Extensions, Limitations, and Future Work
SSViT’s selective spatiotemporal approaches are amenable to other modalities, such as event-based audio and radar, or in scenarios requiring efficient high-dimensional attention. Biological inspiration (the saccade) can generalize to sequence data beyond vision. Practical hardware acceleration is an active direction, especially given the complexity and event-driven flow in the spiking architecture.
Current limitations include:
- Reliance on multi-timestep inference (–8).
- Fixed patch grids; dynamic saccade regions are yet to be realized.
- ’s lower-triangular constraint may not capture more general temporal relations.
- For modulo recovery, explicit positional encoding is omitted because tasks do not require absolute coordinates, but this could be re-examined for scene-dependent masks.
Future directions include:
- Learnable dynamic patch grids for adaptive ROI selection in saccadic attention.
- Hardware-software co-design to exploit event-driven, linear complexity models.
- Hybrid SNN-ANN stacks, where shallow SNN front-ends produce sparse, saccade-driven masks for downstream ANN modules.
A plausible implication is that selective, biologically inspired spatiotemporal attention offers a systematic path towards both energy and computational efficiency in domains where attention to sparse, informative regions is critical. The demonstrated state-of-the-art results in both spiking vision tasks and in modulo video recovery indicate broad applicability for SSViT architectures under resource constraints (Wang et al., 18 Feb 2025, Geng et al., 9 Nov 2025).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free